HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

ArXi:2406.01914v3 Announce Type: replace-cross Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms.