AI RESEARCH

Nectar: Neural Estimation of Cached-Token Attention via Regression

arXiv CS.LG

ArXi:2605.09778v1 Announce Type: new Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer.