AI RESEARCH
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
arXiv CS.LG
•
ArXi:2510.08055v2 Announce Type: replace Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits the processing of long prompts along the token dimension and interleaves prefill with ongoing decode iterations.