Krasis LLM Runtime: 8.9x prefill / 4.7x decode vs llama.cpp — Qwen3.5-122B on a single 5090, minimal RAM

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Since Krasis' initial release I've been working on optimising decode speeds. This has led to dropping the dual-format system and moving to run both prefill and decode entirely on GPU with very different optimisation strategies.