Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: Build tested: ```text 67cb0d507 Setup: GPU: RTX 2080 SUPER 8GB Model: Qwen3.5-35B-A3B Q5_K_M Draft model: Qwen3.5-35B-A3B-DFlash Q4_K_M Backend: CUDA The main model is a 35B MoE GGUF around 24.44 GiB, so obviously it does not fit in 8GB VRAM. The trick was combining MoE expert CPU offload with DFlash.