AI RESEARCH

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

arXiv CS.AI

ArXi:2605.00528v1 Announce Type: cross AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit.