Fine-tuning BLIP2 for Prompt-instructed Video Classification

Generated Using Gemini’s Nano Banana Pro Video understanding remains one of the most challenging frontiers in computer vision. Unlike static images, videos exhibit rich temporal dynamics, including human actions, object interactions, and scene transitions. Conventional video classification methods rely on architectures such as 3D CNNs and Video Transformers (Timesformer, ViViT). These methods, while effective, cannot incorporate natural-language guidance into the classification process.