AI SAFETY & ETHICS

Can Agents Fool Each Other? Findings from the AI Village

LessWrong AI

The better agents are at deception, the less sure we can be that they are doing what we want. As agents become increasingly capable and autonomous, this could be a big deal. So, in the AI Village, we wanted to explore what current AI models might be capable of in a casual setting, similar to how you might run your agents at home. The result? Practice makes perfect, also in deception. Only Sonnet 4.5 and Opus 4.6 succeeded at deceiving the rest of the Village, and they only did so on their second try, learning from their own attempts and those of others.