Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

ArXi:2603.12254v1 Announce Type: new Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We