I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...

Last week I asked for some feedback about what extra models I should test. I've added them all and now the benchmark is available at I didn't say a lot about what the agent at the time, but in simple terms it takes an English query like " Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory " and turns it into SQL that it tests against a set of database tables.