How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

ArXi:2605.16359v1 Announce Type: new Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing