AI RESEARCH

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

arXiv CS.AI

ArXi:2603.26796v1 Announce Type: cross We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further