AI RESEARCH

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

arXiv CS.LG

ArXi:2605.04450v1 Announce Type: cross Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized.