AI RESEARCH

Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

arXiv CS.LG

ArXi:2603.17199v1 Announce Type: new Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning.