LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

ArXi:2605.06809v1 Announce Type: cross Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We