The Open Agent Leaderboard

Can we measure generality? What we built How to read the leaderboard What we're already learning What's public today What we want from the community What's next Closing Related reading How good are general purpose AI agents? We built an open evaluation framework to find out.