AI RESEARCH
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
arXiv CS.AI
•
ArXi:2508.16703v2 Announce Type: replace-cross On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling.