Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

ArXi:2604.10681v1 Announce Type: cross Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or malicious reasoning steps into its chain-of-thought (CoT.