AI RESEARCH
POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving
arXiv CS.LG
•
ArXi:2604.16583v1 Announce Type: new Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori.