OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies

ArXi:2603.10052v1 Announce Type: cross Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they nstrate limited performance on complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models.