AI RESEARCH
PVI: Plug-in Visual Injection for Vision-Language-Action Models
arXiv CS.LG
•
ArXi:2603.12772v1 Announce Type: cross VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert.