Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

ArXi:2605.07447v1 Announce Type: cross Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest