Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

ArXi:2509.20979v2 Announce Type: replace In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance unreliable when predictions are inaccurate.