AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

ArXi:2603.28696v1 Announce Type: cross Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a