PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short. We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter. Repo: github.com/Luce-Org/lucebox-hub (open source.