Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

ArXi:2603.05929v1 Announce Type: new Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus.