Unleashing Vision-Language Semantics for Deepfake Video Detection

ArXi:2603.24454v1 Announce Type: new Recent Deepfake Video Detection (DFD) studies have nstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection.