Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

ArXi:2605.04357v1 Announce Type: cross The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we