AI RESEARCH

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

arXiv CS.AI

ArXi:2603.18280v1 Announce Type: cross Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow.