Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

ArXi:2505.12189v2 Announce Type: replace Large language models (LLMs) exhibit reasoning biases, often conflating content plausibility with formal logical validity. This can lead to wrong inferences in critical domains, where plausible arguments are incorrectly deemed logically valid or vice versa. This paper investigates how content biases on reasoning can be mitigated through activation steering, an inference-time technique that modulates internal activations.