X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

ArXi:2603.08483v1 Announce Type: cross The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection.