Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

ArXi:2603.08343v1 Announce Type: new The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25% of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation.