TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

ArXi:2511.18359v2 Announce Type: replace How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper