$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

ArXi:2604.26461v1 Announce Type: new Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation.