TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

r/LocalLLaMA
Generative AI Open Source AI AI Research

An adaptation of the recent TurboQuant algorithm (Zandieh, 2025) from KV‑cache quantization to model weight compression. It gives you a drop‑in replacement for nn. Linear with near‑optimal distortion. Benchmarks (Qwen3.5‑0.8B, WikiText‑103) Config Bits PPL Δ PPL Compressed Size Baseline bf16 16 14.29 - 1,504 MB 4+4 residual 8 14.29 0.00 762 MB 4‑bit (group=full) 4 16.23 +1.94 361 MB 4‑bit (group=128) 4 16.57 +2.28 381 MB Check the GitHub repo for full docs, benchmarks, and Triton kernel details. submitted by /u/cksac [link] [comments.