Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

ArXi:2604.04579v1 Announce Type: new Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency.