LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference

ArXi:2511.14510v2 Announce Type: replace During LLM inference, KVCache memory usage grows linearly with sequence length and batch size and often exceeds GPU capacity. Recent proposals offload KV states to host memory and reduce transfers using top-k attention. But their CPU-centric management of the on-GPU cache and CPU-GPU data movement incurs high overhead and fragments the bulk GPU execution that CUDA Graph relies on. To close this gap, we observe that adjacent queries within the same attention head exhibit strong directional similarity and retrieve highly overlapping top-k KV states.