Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

ArXi:2603.05963v1 Announce Type: cross Recent advances in large-scale pretrained vision models have nstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without