SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

ArXi:2605.14110v1 Announce Type: new Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We