AI RESEARCH

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

arXiv CS.LG

ArXi:2605.08913v1 Announce Type: new Autoregressive inference is typically assumed to scale predictably with decoding length, and key-value (KV) caching is widely regarded as a universally beneficial optimization for accelerating decoding. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations.