AI RESEARCH

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

arXiv CS.LG

ArXi:2604.22881v1 Announce Type: new Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup. To bridge the I/O gap between tiers, MTServe.