A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

ArXi:2510.00978v2 Announce Type: replace Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We.