AI RESEARCH

Differential Transformer V2

Hugging Face Blog

Abstract Code Motivation Faster Decoding & No Custom Kernels Softmax Magnitude Constraint Beyond Softmax Constraint & Elimination of Attention Sinks Experimental Observations Discussions Construction of Differential Operation Design Ablations Miscellaneous Tianzhu Ye, Li Dong, Yutao Sun, Furu Wei