HydraLM: 22× faster decoding and 16× smaller state memory in long-context inference experiments [P]

I’ve been experimenting with HydraLM, a long-context model for inference, and the numbers are getting a bit wild: the repo’s benchmark suite shows 1.00 retrieval accuracy even when the target fact is buried at 90% depth in a 1M-token test, p = 0.987 and p = 0.999 on a 1M-key fact bank, speculative decoding up to 1.8× faster, and reproducible results that also report about 99.8% FLOP savings and full memory savings at long context. The benchmark docs, reproduction scripts, and verification logs are public, so anyone can check the results for themselves. submitted by /u/cyh-c [link] [comments.