TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max
Dev.to AI
•
Generative AI
Com/blog/turboquant-m5-max-quality-and-asymmetric. Cross-posted here for the de.to audience. Yesterday's M5 Max KV cache post drew a clean set of asks in the comments: where are the perplexity numbers, what about KL divergence, did you try asymmetric K/V combos, can you fill the 32K to 128K gap with a 64K row. I ran them overnight on the same hardware. Numbers below. TL;DR q8_0 KV cache is essentially free at 4k context. PPL delta vs f16 is −0.0005 (well inside the ±0.036 stderr). KL is 0.0016. Top-1 token agreement is 98.64%. turbo3 and turbo4 cost real but small quality.