CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

ArXi:2605.11723v1 Announce Type: cross In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning.