AI RESEARCH

Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

arXiv CS.LG

ArXi:2602.22812v2 Announce Type: replace Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also s partial matching. As this approach