Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

ArXi:2604.09164v1 Announce Type: new Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis.