Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

ArXi:2603.26365v1 Announce Type: new Multimodal Large Language Models have nstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness.