Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

ArXi:2603.23914v1 Announce Type: cross Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens.