SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

ArXi:2604.09037v1 Announce Type: cross Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We