OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

ArXi:2605.11803v1 Announce Type: cross As Video Large Language Models (Video-LLMs) scale to longer and complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames