Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

r/LocalLLaMA
Generative AI AI Research

TL;DR: We implemented NES-inspired memory paging for transformers. On a 1.1B parameter model, inference is now 78% faster (17.01 → 30.42 tok/sec) with nearly zero VRAM overhead. The algorithm is open source, fully benchmarked, and ready to use. The Problem KV cache grows linearly with sequence length. By 4K tokens, most of it sits unused - recent tokens matter far than old ones, yet we keep everything in VRAM at full precision. Standard approaches (quantization, pruning, distillation) are invasive. We wanted something simpler: just move the old stuff out of the way.