OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

ArXi:2605.11678v1 Announce Type: new End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification.