What Happened When I Applied Karpathy's Autoresearch Idea to LLM Inference (3 minute read)
TLDR AI
•
Generative AI
AI Hardware
AI Research
Manthan Gupta built Auto-Inference-Optimiser to let an AI agent hill-climb on LLM inference speed while keeping quality fixed on Apple Silicon. Argmax sampling and simplifying inference code gave the largest throughput gains, while most tuning knobs and KV cache quantization hurt or had no effect. The project highlights that a disciplined, observable harness is critical for distinguishing real performance wins from noise or benchmark illusions.