See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

ArXi:2605.11817v1 Announce Type: cross Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup.