Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

ArXi:2603.12707v1 Announce Type: cross Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution.