MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

ArXi:2604.12537v1 Announce Type: cross Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented.