Falcon Perception

ArXi:2603.27365v1 Announce Type: new Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and.