Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

ArXi:2603.26052v1 Announce Type: cross As multimodal misinformation becomes sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We