Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)

I’ve been working on an open source TurboQuant implementation for KV cache compression in llama.cpp and ran into a hard bottleneck: dequantization. At long context (32K on M5 Max), dequant alone was taking around 40% of decode time. I tried fixing it the usual way: - register LUTs - SIMD tricks - fused kernels - branchless math Tested about 14 different approaches. None beat the baseline. Hardware was already at the limit. What ended up working was much simpler. Flash attention computes softmax weights before touching V. At long context, most of those weights are basically zero.