Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

ArXi:2603.06854v1 Announce Type: cross Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal.