AI RESEARCH
Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs
arXiv CS.LG
•
ArXi:2603.22206v1 Announce Type: new Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity