Speculative decoding question, 665% speed increase

r/LocalLLaMA
Generative AI Open Source AI

Im using these settings in llama.cpp: --spec-type ngram-map-k --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 Whats the real reason for lets say the prompt is for "minor changes in code", whats differing between models: Gemma 4 31b: Doubles in tks gen so 100% Qwen 3.6: Only 40% speed Devstrall small: 665% increase in speed (what?) EDIT: added --repeat-penalty 1.0 and --spec-type ngram-mod instead for Qwen 3.6, now speed is increased by 140tks over 100tks base in minor edits. submitted by /u/GodComplecs [link] [comments.