Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

ArXi:2505.17862v2 Announce Type: replace Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We