One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

ArXi:2505.23617v3 Announce Type: replace-cross Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We