Dear Student, You're about to learn how attention works in transformers by calculating everything step-by-step. Follow along, try the interactive exercises, and see how "apple" gets its meaning from context!
Step 0 of 9
"I bought apple to eat"
๐ STEP 1: Understanding Your Data
Your task: We'll track how the word "apple" (highlighted in red) pays attention to other words to understand if it's a fruit or a company.
Word Embeddings (4-dimensional vectors)
Think of these as coordinates in 4D space that represent each word's meaning.
Position
Word
Embedding Vector [eโ, eโ, eโ, eโ]
1
I
[1.0, 0.2, 0.5, 0.3]
2
bought
[0.8, 1.0, 0.3, 0.6]
3
apple
[0.6, 0.4, 1.0, 0.2] โ YOUR FOCUS WORD
4
to
[0.3, 0.7, 0.2, 0.8]
5
eat
[0.9, 0.6, 0.4, 1.0]
Weight Matrices (4ร4 each)
These matrices transform word embeddings to create Query, Key, and Value vectors. Think of them as different "lenses" to view word relationships.
WQ (Query Matrix)
Transforms words into "search queries"
0.5
0.3
0.1
0.2
0.2
0.8
0.4
0.1
0.7
0.1
0.6
0.3
0.4
0.6
0.2
0.8
WK (Key Matrix)
Transforms words into "information advertisements"
0.6
0.4
0.2
0.3
0.3
0.9
0.1
0.5
0.8
0.2
0.7
0.4
0.1
0.7
0.3
0.9
WV (Value Matrix)
Transforms words into "information content"
0.9
0.1
0.3
0.4
0.4
0.6
0.2
0.7
0.2
0.8
0.5
0.1
0.7
0.3
0.6
0.8
๐งฎ STEP 2: Matrix Multiplication Rules
How to multiply a 4D vector with a 4ร4 matrix:
If embedding = [eโ, eโ, eโ, eโ] and Weight = 4ร4 matrix
These are the actual "information payloads" each word will contribute to the final meaning.
โ All V Vectors Complete
Word
V Vector (4D)
Information Content
I
[1.290, 0.710, 0.820, 0.950]
Agent information
bought
[1.600, 1.100, 0.690, 1.340]
Purchase information
apple
[1.040, 1.160, 0.880, 0.780]
Object information
to
[1.150, 0.850, 0.740, 1.060]
Purpose information
eat
[1.830, 1.070, 0.920, 1.450]
CONSUMPTION information
๐ฏ STEP 6: Calculate Attention Scores
Now the magic happens! We measure how much each word's query matches every other word's key using dot products. This creates a 5ร5 matrix of attention scores.
๐ Key Observations:
โข Every word attends most strongly to "eat" (see the green highlighted column)
โข This makes sense - "eat" is the most informative word for understanding the sentence context
โข "eat" also has the highest self-attention (5.679), showing it's very confident in its own meaning
โข Apple's second-highest attention is to "bought" (3.916), connecting the purchase action to the object
๐ง Quick Check: What do these scores tell us?
Question: Apple's highest attention score is with "eat" (4.153). What does this suggest?
๐ STEP 7: Scale by โd_k
We divide ALL attention scores by โ4 = 2 to prevent them from getting too large (which would make softmax too "sharp" and reduce learning).
All raw attention scores divided by โ4 = 2. Notice how all values are now smaller and more manageable. Apple's row is highlighted in red!
Query โ Key โ
I
bought
apple
to
eat
I
1.438
2.162
1.544
1.373
2.301 โญ
bought
1.623
2.506
1.958
1.599
2.644 โญ
apple
1.544
1.958
1.786
1.476
2.077 โญ
to
1.327
2.094
1.476
1.445
2.184 โญ
eat
1.784
2.617
2.077
1.861
2.840 โญ
๐ฏ Practice Scaling
Scale "bought" โ "bought": 5.012 รท 2 = ?
๐ Apple's Complete Scaled Scores
Apple's raw scores โ scaled scores:
Apple โ I: 3.088 รท 2 = 1.544
Apple โ bought: 3.916 รท 2 = 1.958
Apple โ apple: 3.571 รท 2 = 1.786
Apple โ to: 2.952 รท 2 = 1.476
Apple โ eat: 4.153 รท 2 = 2.077 โญ Still highest!
๐ Why Scale?
โข Before scaling: Scores ranged from 2.654 to 5.679 (large range)
โข After scaling: Scores range from 1.327 to 2.840 (smaller, more manageable)
โข Benefit: Prevents softmax from becoming too "peaky" and allows better gradient flow during training
โข Pattern preserved: The relative ordering of attention scores remains the same
๐ฒ STEP 8: Apply Softmax
Softmax converts ALL scaled scores into probabilities that sum to 1.0 for each word. This is like giving each word a "attention budget" to distribute among all other words.
Softmax Steps for Each Row:
1. Calculate e^(each scaled score in the row)
2. Sum all the exponentials in that row
3. Divide each exponential by the row sum
4. Result: Each row sums to exactly 1.0
๐ All Words' Attention Patterns:
โข "I" pays most attention to "eat" (26.1%) - The subject focuses on the action
โข "bought" pays most attention to "eat" (30.4%) - The purchase action connects to consumption
โข "apple" pays most attention to "eat" (26.5%) - The object understands it's food
โข "to" pays most attention to "eat" (30.8%) - The preposition links to the purpose
โข "eat" pays most attention to itself (28.2%) - High self-confidence as the key action
๐ฏ Universal Pattern: Every word in this sentence realizes that "eat" is the most important context!
๐ฏ Attention Budget Verification
Check that "bought" row sums to 1.0:
0.125 + 0.286 + 0.167 + 0.118 + 0.304 = ?
Softmax Steps:
1. Calculate e^(each score)
2. Sum all the exponentials
3. Divide each exponential by the sum