TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

ArXi:2605.17170v1 Announce Type: new Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning.