Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

ArXi:2603.06140v1 Announce Type: cross Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs