Understanding Self-Attention: A Mathematical Deep Dive
How transformers learn context through the lens of linear algebra and attention mechanisms.
The Core of the Transformer
The Transformer architecture, introduced in "Attention is All You Need," revolutionized NLP by allowing models to process sequences in parallel while maintaining long-range dependencies. At its heart lies the Self-Attention mechanism.
"Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence."
The Mathematical Foundation
In self-attention, we project each input vector into three distinct spaces: Query (Q), Key (K), and Value (V). This is done using learned weight matrices .
The Scaled Dot-Product Attention
The calculation of the attention weights is performed using a scaled dot-product between the Query and Key matrices, followed by a softmax activation:

Step-by-Step Breakdown
1. The Similarity Score
We calculate how much focus to put on other parts of the input sequence by taking the dot product of the Query vector with the Key vectors of all other words.
2. Scaling
We scale the scores by (the dimension of the keys). This prevents gradients from becoming too small during the softmax step for large dimensions.
3. Softmax
The softmax ensures the scores are positive and sum to 1, effectively turning them into probabilities or "attention weights."
Why "Self" Attention?
The word "self" refers to the fact that the mechanism examines the same input sequence that it is processing. It helps the model understand that in the sentence "The animal didn't cross the street because it was too tired", the word "it" refers to the "animal".
References & Further Reading
- Vaswani, A., et al. (2017). Attention Is All You Need.
- The Illustrated Transformer by Jay Alammar.