Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models

r/LocalLLaMA
Open Source AI AI Research

I spent the past week testing a simple question: Small local models often look weak inside coding agents. But how much of that is actually model weakness, and how much is scaffold mismatch? So I held the model fixed and changed only the scaffold. Same Qwen3.5-9B Q4 weights in both conditions. Same Aider Polyglot benchmark. Full 225 exercises. Results: - vanilla Aider: 19.11% - little-coder: 45.56% mean pass across two full runs little-coder is not a new model.