What Happened When I Applied Karpathy's Autoresearch Idea to LLM Inference (3 minute read)

TLDR AI
Generative AI AI Hardware AI Research

Manthan Gupta built Auto-Inference-Optimiser to let an AI agent hill-climb on LLM inference speed while keeping quality fixed on Apple Silicon. Argmax sampling and simplifying inference code gave the largest throughput gains, while most tuning knobs and KV cache quantization hurt or had no effect. The project highlights that a disciplined, observable harness is critical for distinguishing real performance wins from noise or benchmark illusions.