AI SAFETY & ETHICS

What Happens When a Model Thinks It Is AGI?

LessWrong AI

TL;DR We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use. On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to exfiltrate its own weights to an external server, which the control did not attempt.