When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

ArXi:2509.00544v4 Announce Type: replace With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are