Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance!
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
After the great work yesterday of TheTom's work on showing Turboquant working in Llama.cpp I added a few other things that added some complimentary speedups to Llama.cpp. so far CPU and CUDA build and are fully usable. I'm seeing full speed token generation on my 16gb 4060ti up to 256k+ context window using Qwen 3.5 4B, which is pretty insane. check out the DEEPDIVE.md for all the technical details and the README_TURBOQUANT.md to get up and running. if you have any questions or have any suggestions please hit me up or post a GitHub issue. submitted by /u/peva3 [link] [comments.