RECIPE: Procedural Planning via Grounding in Instructional Video

ArXi:2605.19976v1 Announce Type: new Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist.