LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs

ArXi:2605.11477v1 Announce Type: new Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or