Krasis LLM Runtime: 8.9x prefill / 10.2x decode vs llama.cpp — Qwen3.5-122B on a single 5090, minimal RAM (corrected llama numbers)
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Update: I've removed llama comparisons from the readme and from the body of this post. Llama decode speeds will be highly dependent on CPU especially DRAM speeds and apparently also on non-default flags. In my testing Krasis is substantially faster for larger models that don't fit entirely in VRAM but the charts above will vary on a variety of factors. Since Krasis' initial release I've been working on optimising decode speeds. This has led to dropping the dual-format system and moving to run both prefill and decode entirely on GPU with very different optimisation strategies.