SOMA: Efficient Multi-turn LLM Serving via Small Language Model

ArXi:2605.11317v1 Announce Type: cross Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large