Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: Build tested: ```text 67cb0d507 Setup: GPU: RTX 2080 SUPER 8GB Model: Qwen3.5-35B-A3B Q5_K_M Draft model: Qwen3.5-35B-A3B-DFlash Q4_K_M Backend: CUDA The main model is a 35B MoE GGUF around 24.44 GiB, so obviously it does not fit in 8GB VRAM. The trick was combining MoE expert CPU offload with DFlash.