Removing Sandbagging in LLMs by Training with Weak Supervision

ArXi:2604.22082v1 Announce Type: cross As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can