Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition

ArXi:2604.17090v1 Announce Type: new Human action recognition and motion generation are two active research problems in human-centric computer vision, both aiming to align motion with textual semantics. However, most existing works study these two problems separately, without uncovering the links between them, namely that motion generation requires semantic comprehension. This work investigates unified action recognition and motion generation by leveraging skeleton coordinates for both motion understanding and generation.