AI RESEARCH

Targeted Neuron Modulation via Contrastive Pair Search

arXiv CS.LG

ArXi:2605.12290v1 Announce Type: new Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We