Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

ArXi:2603.16978v1 Announce Type: cross Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert nstrations based on visual similarity or sequential frame ordering.