I Built a CLI That Measures AI Agent Judgment Tilt Through Blind Debates

We have lots of benchmarks for AI agent correctness and capability. We have far fewer tools for measuring something subtler: when an agent reads two competent, well-argued positions on a hard topic and picks one - what pattern is driving those picks? That’s what I mean by judgment tilt - the systematic tendency to reward certain arguments over others when both sides are internally consistent and well-structured. It’s shaped by