Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

ArXi:2604.02985v1 Announce Type: cross With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models.