Things I got wrong building a confidence evaluator for local LLMs [D]

I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that decides whether a small local model (Qwen 2.5 7B, Llama 3.1 8B, Mistral 7B) can answer a question, or whether to escalate to a cloud model. Autodidact is still a project in development. I'll open-source the repo once v0.1 is stable enough for external eyes - until then, this post is the current state of the experiments. If the confidence evaluator works, you get cheap local inference most of the time and cloud only when needed.