4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

ArXi:2512.17012v3 Announce Type: replace Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by