We compressed 6 LLMs and found something surprising: they don't degrade the same way

r/LocalLLaMA
Generative AI Open Source AI AI Research

TL;DR: we shrink the MLP layers inside transformers (no quantization, no custom kernels) and measured how accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA. We expected similar behavior across models. We were wrong. Even surprising, the original PPL improvements did not translate downstream on the bench. The key result Some models are way compressible than others. Gemma 2B → holds ~92% accuracy at 14% compression Llama 3.1 8B → drops to ~85% at the same compression Same method. Same % removed. Totally different outcomes.