AI RESEARCH

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

arXiv CS.LG

ArXi:2604.02292v1 Announce Type: new Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits.