Dynamic few-shot retrieval on Apple's on-device 3B LLM: 40% → 70%+ on shell commands

I've been poking at Apple's on-device 3B model (via FoundationModels on Tahoe) to see where its ceiling sits on code-adjacent tasks. Tested shell command generation as a concrete benchmark (100 prompts, ~10 approaches) Bare model: ~40% correct. Mostly flags and some command hallucinations. Feeding documentation as context didn't help. Not man pages, not tldr as docs, not self-critique loops. All within noise of baseline, and self-critique was actively worse (33%); the model "fixes" correct commands into wrong ones.