Looking for feedback: Porting Google's TurboQuant (QJL) KV Cache compression to MLX
r/LocalLLaMA
•
AI Hardware
AI Research
Hey r/LocalLLaMA, I've been working on implementing the concepts from Google Research's recent TurboQuant (QJL) paper natively in MLX for Apple Silicon. The paper claims massive KV cache compression (down to 1-bit/3-bit) with near-zero accuracy loss. I've successfully built and deployed a working implementation ( TurboKVCacheMLX ) directly into my local mlx_lm library and just finished a real-world benchmark on a __TECH_PRESERVE_1TECH_PRESERVE_3__.