I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

I'm working on a constrained agentic benchmark task - it requires multiple LLM calls with feedback. Are there any good, small model I should try (or people are interested in comparing)? I'm especially interested in anything in the sub-10B range that can do reliable tool calling. Here's what I have so far: submitted by /u/nickl [link] [comments]