Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware

InfoQ AI/ML
Generative AI AI Research

Google Research unveiled TurboQuant, a novel quantization algorithm that compresses large language models’ Key-Value caches by up to 6x. With 3.5-bit compression, near-zero accuracy loss, and no re