Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

r/LocalLLaMA
Generative AI Open Source AI

Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B? For Gemma, I was thinking about using a smaller same-family draft model. For Qwen 3.5, I’m not sure if it works well at all in llama.cpp. If you tried it, which draft model worked best and did you get a real speedup? submitted by /u/No_Algae1753 [link] [comments]