I am working on video summarization task and want to extract relevant static frames from the video.
Idea is: user describes his interests and AI summarizes video in text as well as compliments text sumary with relevant images extracted from video.
Ex:
"how to grill a steak" - images of BBQ, unpacking, spicing, temperature measurement, flares and then final result.
"top investment advisers" - face shots of top advisers, snapshots of charts of their performance, etc...
Looking for ideas on approaches to accomplish this.