CircuitProbe: Tracing Visual Temporal Evidence Flow in Video Language Models

ArXi:2507.19420v2 Announce Type: replace-cross Autoregressive large vision--language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space as continuous visual token embeddings. However, it remains unclear where temporal evidence is represented and how it causally influences decoding.