StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

ArXi:2603.28565v1 Announce Type: cross Vision-language-action (VLA) models have nstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency.