AI RESEARCH

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

arXiv CS.LG

ArXi:2604.17695v1 Announce Type: new KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor -- token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing -- but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform.