SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

ArXi:2603.15397v1 Announce Type: cross Large language models (LLMs) have nstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering applied only to the final output, leaving intermediate reasoning steps unmonitored and vulnerable to adversarial manipulation.