Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

ArXi:2604.17062v1 Announce Type: new Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information.