Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning

ArXi:2508.01184v2 Announce Type: replace A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency.