Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Hello, I spent the last few months building an AI agent that autonomously writes Go code using local LLMs. The primary use case is log parser generation for SIEM pipelines. A large part of the work ended up being evaluation itself: how do you objectively measure whether a model is actually useful for autonomous coding tasks? So I built a harness that (1) lets agents generate real Go parsers, (2) compiles the Go code, (3) validates extracted fields and types, (4) measures parsing quality against expected schemas, (5) and tracks throughput/speed over longer runs.