Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

ArXi:2512.07222v3 Announce Type: replace To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for aligned and robust VLMs.