AI RESEARCH
Stream2LLM: Overlap Context Streaming and Prefill for Reduced TTFT
arXiv CS.AI
•
ArXi:2604.16395v1 Announce Type: cross Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Recent work mitigates this via streaming--overlapping retrieval with inference--but prior systems focus on single-request settings and overlook challenges in multi-tenant deployments where concurrent requests contend for GPU memory and scheduling must adapt to dynamic context arrivals.