SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

ArXi:2603.11563v1 Announce Type: new Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing