llama.cpp DeepSeek v4 Flash experimental inference

r/LocalLLaMA
Generative AI Open Source AI

Hi, here you can find experimental llama.cpp for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, and the speed of 17 t/s in my MacBook M3 Max is quite interesting, I would say we are into the usable zone. What I did was to heavily quantize the routed experts to 2 bits using two different 2 bit quants to balance error and size.