DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

ArXi:2605.11683v1 Announce Type: new Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference.