Please help me with this hard AI question?

I'm trying to write an agent that analyzes a video and writes a detailed narrative of what's happening, including dialogue. Here's what I did. Dialogue part is easy just transcribe audio by timestamp. Video part is hard. Here's what I've done so far: Convert video into series of image files, 5 images (frames) per second. Load each frame and verbally describe what's going on in it. Compare the verbal summaries of the frames to see what changed from frame to frame.