ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

ArXi:2508.16703v2 Announce Type: replace-cross On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling.