AI RESEARCH
Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries
arXiv CS.AI
•
ArXi:2605.18891v1 Announce Type: cross Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head.