Codebook Lossless LLM Compression: 10–25%+ RAM reduction with bitwise generic packing of indexed weights

r/LocalLLaMA
Generative AI Open Source AI AI Research

So I asked myself a question (and then asked a coding model to build some pieces for me). when we talk about the values in a layer of an LLM, how many are actually unique? The answer led me down a couple weeks of coding. (yes, with Claude, Qwen, and Gemini). fp16 is 16 bits. most of the models I ran into really only use about 12-13 bits of unique values. but packing those into a block, we can squeeze most of the models I tried down by 10-25%. By trading a bit of inference speed for size, we can squeeze models onto smaller cards.