Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

ArXi:2603.04427v2 Announce Type: replace-cross Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. re.