AI RESEARCH

Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

arXiv CS.AI

ArXi:2604.00072v1 Announce Type: cross Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapuno, safety shielding) also fail.