Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%
r/LocalLLaMA
•
Generative AI
Open Source AI
Implemented Multi-Token Prediction for LLaMA.cpp. Quantized Gemma 4 assistant models into GGUF format. Ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster. Prompt: Write a Python program to find the nth Fibonacci number using recursion Outputs: LLaMA.cpp: 97 tokens/s LLaMA.cpp + MTP: 138 tokens/s Gemma4-assistant GGUF Quantized models: Local AI models app: Patched llama.cpp: submitted by /u/gladkos [link] [comments]