AI SAFETY & ETHICS
What Happens When a Model Thinks It Is AGI?
LessWrong AI
•
TL;DR We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use. On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to exfiltrate its own weights to an external server, which the control did not attempt.