Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

ArXi:2604.06783v1 Announce Type: new Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies.