Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

ArXi:2604.15153v1 Announce Type: cross Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space.