Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

ArXi:2603.06697v1 Announce Type: new Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by