PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

ArXi:2411.02327v3 Announce Type: replace In the past year, video-based large language models (Video LLMs) have achieved impressive progress, particularly in their ability to process long videos through extremely extended context lengths. However, this comes at the cost of significantly increased computational overhead due to the massive number of visual tokens, making efficiency a major bottleneck. In this paper, we identify the root of this inefficiency as the high redundancy in video content.