APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing)
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Research
I've just released APEX (Adaptive Precision for EXpert Models): a novel MoE quantization technique that outperforms Unsloth Dynamic 2.0 on accuracy while being 2x smaller for MoE architectures. Benchmarked on Qwen3.5-35B-A3B, but the method applies to any MoE model. Half the size of Q8. Perplexity comparable to F16. Works with stock llama.cpp with no patches.