I Built a CLI That Measures AI Agent Judgment Tilt Through Blind Debates

Towards AI
Generative AI

We have lots of benchmarks for AI agent correctness and capability. We have far fewer tools for measuring something subtler: when an agent reads two competent, well-argued positions on a hard topic and picks one - what pattern is driving those picks? That’s what I mean by judgment tilt - the systematic tendency to reward certain arguments over others when both sides are internally consistent and well-structured. It’s shaped by