Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max

r/LocalLLaMA
Open Source AI AI Research

Followup to yesterday's post:. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Ran them overnight. Same M5 Max, same Qwen 3.6-35B-A3B Q8, same TheTom TurboQuant fork (feature/turboquant-k-cache). Quality (perplexity + KL divergence on wikitext-2) For u/milpster and u/Karyo_Ten. Context size 4096, since the canonical 512 doesn't fill enough KV cache to surface cache-quantization effects. f16 saves the baseline logits via --kl-divergence-base, then each quant run computes KL against that.