[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not s, not simulation, not success rates on 10 tries. Actual production metrics on real hardware. I couldn't find honest numbers anywhere, so I built a benchmark. Setup: DROID platform, bin-to-bin order picking - one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running.