HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

ArXi:2510.23043v2 Announce Type: replace Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales.