InstrAct: Towards Action-Centric Understanding in Instructional Videos

ArXi:2604.08762v1 Announce Type: cross Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pre