Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

ArXi:2511.21998v2 Announce Type: replace Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time.