Towards multi-modal forgery representation learning for AI-generated video detection and localization

ArXi:2605.07232v1 Announce Type: new Recent advances in generative AI have cratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty.