AI RESEARCH
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
arXiv CS.LG
•
ArXi:2507.21433v3 Announce Type: replace Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users.