AI RESEARCH
Attention Is Where You Attack
arXiv CS.AI
•
ArXi:2605.00236v1 Announce Type: cross Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We