HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

ArXi:2604.18791v1 Announce Type: cross Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap.