Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

ArXi:2605.07588v1 Announce Type: cross Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We