Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

ArXi:2603.15467v1 Announce Type: new Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning.