Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?
r/LocalLLaMA
•
Generative AI
Open Source AI
Hey everyone, Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the discussions, I'm honestly still a bit confused: is it actually applicable for us right now? And if so, how? I recently saw an article/post where someone applied this TQ quantization directly to the model weights.