AI RESEARCH
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
arXiv CS.LG
•
ArXi:2604.19157v1 Announce Type: new KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment.