Decoupled Action Expert: Confining Task Knowledge to the Conditioning Pathway

ArXi:2511.12101v2 Announce Type: replace-cross Many recent Vision-Language-Action models employ diffusion or flow-matching backbones with hundreds of millions of parameters for action generation. However, unlike image synthesis where the output spans millions of diverse pixels, a manipulation policy generates only short sequences of low-dimensional, physically correlated action values, a far simpler target that should not demand such capacity.