Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

ArXi:2509.22496v5 Announce Type: replace Multimodal large language models (MLLMs) have nstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence.