AI RESEARCH
Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness
arXiv CS.LG
•
ArXi:2506.24056v2 Announce Type: replace-cross RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We