OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

ArXi:2604.11804v1 Announce Type: new In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce nstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions.