BERT vs. Transformers: A Complete Architectural and Mathematical Dissection

When I first read about BERT (Bidirectional Encoder Representations from Transformers), the number one question that plagued my mind was a simple definition issue. We are told BERT is “Bidirectional.” But wait - BERT is based on the Transformer Encoder. In the standard Transformer architecture, the encoder’s self-attention mechanism already attends to every token simultaneously (all-to-all attention