SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

ArXi:2605.08412v1 Announce Type: new Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures.