Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

ArXi:2506.24056v2 Announce Type: replace-cross RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We