Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

ArXi:2405.13901v4 Announce Type: replace-cross Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by