TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

ArXi:2502.20969v4 Announce Type: replace-cross Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datas, creating a significant system challenge: achieving high throughput and low latency is difficult, especially when GPU memory is limited. To address these challenges, we propose TeleRAG, an efficient inference system that reduces latency and improves throughput with minimal GPU memory requirements.