SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

ArXi:2605.14704v1 Announce Type: new In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we