FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

ArXi:2503.09158v5 Announce Type: replace Existing video large language models (VLLMs) primarily leverage prompt agnostic visual encoders, which extract untargeted facial representations without awareness of the queried information, leading to the loss of task critical cues. To address this challenge, we propose FaVChat, the first VLLM designed for reasoning over subtle visual and dynamic facial cues. FaVChat