CAST: Cross-Attentive Spatio-Temporal feature fusion for deepfake detection

ArXi:2506.21711v2 Announce Type: replace Deepfakes have emerged as a significant threat to digital media authenticity, increasing the need for advanced detection techniques that can identify subtle and time-dependent manipulations. CNNs are effective at capturing spatial artifacts and Transformers excel at modeling temporal inconsistencies. However, many existing CNN-Transformer models process spatial and temporal features independently.