Unweight: how we compressed an LLM 22% without sacrificing quality

r/LocalLLaMA
Generative AI AI Hardware

Summary: In LLM inference on modern GPUs (like the NVIDIA H100), the bottleneck is memory bandwidth, not computational speed. The time it takes to move model weights from the GPU's slow main memory (HBM) to its processing cores limits how fast tokens can be generated. **The Solution: Unweight** Cloudflare developed Unweight, a **lossless** compression system that shrinks model weights by 15-22% (saving ~3 GB of VRAM on an 8B parameter model) while preserving bit-exact outputs, all without needing specialized hardware.