Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling

ArXi:2512.23162v4 Announce Type: replace-cross Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics.