HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

ArXi:2508.15919v3 Announce Type: replace-cross Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements.