Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

ArXi:2603.05772v1 Announce Type: cross Currently, open-sourced large language models (OSLLMs) have nstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense.