OjaKV: Context-Aware Online Low-Rank KV Cache Compression

ArXi:2509.21623v2 Announce Type: replace-cross The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a __TECH_PRESERVE_0TECH_PRESERVE_2__ processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights.