CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

ArXi:2604.24622v1 Announce Type: new Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling.