Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

ArXi:2603.09930v1 Announce Type: new Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results.