Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines

ArXi:2604.16734v1 Announce Type: new Multimodal large language models (MLLMs) have recently nstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale to richer visual representations, inference increasingly relies on storing large numbers of vision tokens in the key-value (KV) cache, making memory consumption a central bottleneck.