Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

ArXi:2505.17815v2 Announce Type: replace As foundation models grow increasingly intelligent, reliable and trustworthy safety evaluation becomes indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave safety-aligned.