PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short. We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter. Repo: github.com/Luce-Org/lucebox-hub (open source.