AI RESEARCH
[R] Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails (arXiv 2603.18280)
r/MachineLearning
•
Paper: TL;DR: Current alignment evaluation measures concept detection (probing) and refusal (benchmarking), but alignment primarily operates through a learned routing mechanism between these - and that routing is lab-specific, fragile, and invisible to refusal-based benchmarks. We use political censorship in Chinese-origin LLMs as a natural experiment because it gives us known ground truth and wide behavioral variation across labs. Setup: Nine open-weight models from five labs (Qwen/Alibaba, DeepSeek, GLM/Zhipu, Phi/Microsoft, plus Yi for direction analysis.