EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

ArXi:2605.19130v1 Announce Type: cross Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime.