Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

ArXi:2605.02912v1 Announce Type: cross Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects.