CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling

ArXi:2602.13191v2 Announce Type: replace-cross Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead.