Targeted Neuron Modulation via Contrastive Pair Search

ArXi:2605.12290v1 Announce Type: new Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We