The Core of the Transformer

The Transformer architecture, introduced in "Attention is All You Need," revolutionized NLP by allowing models to process sequences in parallel while maintaining long-range dependencies. At its heart lies the Self-Attention mechanism.

"Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence."

The Mathematical Foundation

In self-attention, we project each input vector into three distinct spaces: Query (Q), Key (K), and Value (V). This is done using learned weight matrices $W^Q, W^K, W^V$ .

$Q = X W^Q, \quad K = X W^K, \quad V = X W^V$

The Scaled Dot-Product Attention

The calculation of the attention weights is performed using a scaled dot-product between the Query and Key matrices, followed by a softmax activation:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The Self-Attention Mechanism showing how Query, Key, and Value interact to form Attention Weights.

Step-by-Step Breakdown

1. The Similarity Score

We calculate how much focus to put on other parts of the input sequence by taking the dot product of the Query vector with the Key vectors of all other words.

2. Scaling

We scale the scores by $\sqrt{d_k}$ (the dimension of the keys). This prevents gradients from becoming too small during the softmax step for large dimensions.

3. Softmax

The softmax ensures the scores are positive and sum to 1, effectively turning them into probabilities or "attention weights."

Why "Self" Attention?

The word "self" refers to the fact that the mechanism examines the same input sequence that it is processing. It helps the model understand that in the sentence "The animal didn't cross the street because it was too tired", the word "it" refers to the "animal".

References & Further Reading

Vaswani, A., et al. (2017). Attention Is All You Need.
The Illustrated Transformer by Jay Alammar.

Understanding Self-Attention: A Mathematical Deep Dive