Krasis LLM Runtime: 8.9x prefill / 4.7x decode vs llama.cpp — Qwen3.5-122B on a single 5090, minimal RAM

Since Krasis' initial release I've been working on optimising decode speeds. This has led to dropping the dual-format system and moving to run both prefill and decode entirely on GPU with very different optimisation strategies.