QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression

ArXi:2603.21232v1 Announce Type: cross Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios.