Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

ArXi:2503.13821v3 Announce Type: replace When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent nstration.