A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning

ArXi:2603.26098v1 Announce Type: cross While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture.