[Tool] Quick hack to recover Qwen3.5 MTP after fine-tuning for faster inference speed (Transformers)

r/LocalLLaMA
Machine Learning Generative AI AI Research AI Tools

Disclaimer: I work at NuMind (we train LLMs for structured + content extraction). If you've been working with Qwen3.5 (and other recently released models), you probably know it includes Multi-Token Prediction (MTP) modules. When used with vLLM ( qwen3_next_mtp ), this can significantly speed up inference, especially on predictable workloads (the "predictable" the better since the draft tokens will have a higher acceptance rate). However: - Hugging Face Transformers doesn’t MTP yet, neither for inference nor