AI RESEARCH
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
arXiv CS.LG
•
ArXi:2605.08913v1 Announce Type: new Autoregressive inference is typically assumed to scale predictably with decoding length, and key-value (KV) caching is widely regarded as a universally beneficial optimization for accelerating decoding. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations.