Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

ArXi:2505.22842v4 Announce Type: replace-cross Transformer-based language models rely on positional encoding (PE) to handle token order and context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model.