Token Warping Helps MLLMs Look from Nearby Viewpoints

ArXi:2604.02870v1 Announce Type: new Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often