BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

ArXi:2604.03957v1 Announce Type: new Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For