AI RESEARCH

Attention Is Where You Attack

arXiv CS.AI

ArXi:2605.00236v1 Announce Type: cross Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We