Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

ArXi:2603.17396v1 Announce Type: new Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pre