AI RESEARCH

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

arXiv CS.LG

ArXi:2510.08055v2 Announce Type: replace Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits the processing of long prompts along the token dimension and interleaves prefill with ongoing decode iterations.