Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

ArXi:2604.18134v1 Announce Type: new Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we