What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

ArXi:2605.12021v1 Announce Type: new Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where.