Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

ArXi:2602.02977v2 Announce Type: replace-cross Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image.