Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

ArXi:2603.06403v1 Announce Type: new Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends incur distinct, time-varying costs due to system jitter and network variation.