AI RESEARCH
OneThinker: All-in-one Reasoning Model for Image and Video
arXiv CS.CV
•
ArXi:2512.03043v3 Announce Type: replace Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities.