AI RESEARCH
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
arXiv CS.AI
•
ArXi:2605.08314v1 Announce Type: cross SVD-based Low-rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD-compressed transformers.