GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

ArXi:2605.00831v1 Announce Type: cross The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software faults, leading to costly job failures, wasted resources, and degraded user experience. The stateful key-value (KV) cache, which grows with the sequence length, presents a central challenge as it is a critical and vulnerable component in distributed serving systems.