Why MLLMs Struggle to Determine Object Orientations

ArXi:2604.13321v1 Announce Type: new Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong and Nichols hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations.