🧮 Interactive Self Attention Walkthrough

Dear Student, You're about to learn how attention works in transformers by calculating everything step-by-step. Follow along, try the interactive exercises, and see how "apple" gets its meaning from context!

Step 0 of 9

"I bought apple to eat"

📋 STEP 1: Understanding Your Data

Your task: We'll track how the word "apple" (highlighted in red) pays attention to other words to understand if it's a fruit or a company.

Word Embeddings (4-dimensional vectors)

Think of these as coordinates in 4D space that represent each word's meaning.

Position	Word	Embedding Vector [e₁, e₂, e₃, e₄]
1	I	[1.0, 0.2, 0.5, 0.3]
2	bought	[0.8, 1.0, 0.3, 0.6]
3	apple	[0.6, 0.4, 1.0, 0.2] ← YOUR FOCUS WORD
4	to	[0.3, 0.7, 0.2, 0.8]
5	eat	[0.9, 0.6, 0.4, 1.0]

Weight Matrices (4×4 each)

These matrices transform word embeddings to create Query, Key, and Value vectors. Think of them as different "lenses" to view word relationships.

W_Q (Query Matrix)

Transforms words into "search queries"

0.5

0.3

0.1

0.2

0.8

0.4

0.1

0.7

0.1

0.6

0.3

0.4

0.6

0.2

0.8

W_K (Key Matrix)

Transforms words into "information advertisements"

0.6

0.4

0.2

0.3

0.9

0.1

0.5

0.8

0.2

0.7

0.4

0.1

0.7

0.3

0.9

W_V (Value Matrix)

Transforms words into "information content"

0.9

0.1

0.3

0.4

0.6

0.2

0.7

0.2

0.8

0.5

0.1

0.7

0.3

0.6

0.8

🧮 STEP 2: Matrix Multiplication Rules

How to multiply a 4D vector with a 4×4 matrix:
If embedding = [e₁, e₂, e₃, e₄] and Weight = 4×4 matrix

Result[0] = e₁×W[0,0] + e₂×W[1,0] + e₃×W[2,0] + e₄×W[3,0]
Result[1] = e₁×W[0,1] + e₂×W[1,1] + e₃×W[2,1] + e₄×W[3,1]
Result[2] = e₁×W[0,2] + e₂×W[1,2] + e₃×W[2,2] + e₄×W[3,2]
Result[3] = e₁×W[0,3] + e₂×W[1,3] + e₃×W[2,3] + e₄×W[3,3]

Try It Yourself!

Calculate the first component of apple's Q vector:

Apple embedding: [0.6, 0.4, 1.0, 0.2]

Q[0] = 0.6×0.5 + 0.4×0.2 + 1.0×0.7 + 0.2×0.4 = ?

Hint: 0.6×0.5 = 0.3, 0.4×0.2 = 0.08, 1.0×0.7 = 0.7, 0.2×0.4 = 0.08
Sum: 0.3 + 0.08 + 0.7 + 0.08 = 1.160

⚡ STEP 3: Calculate Q Vectors (Query)

You're creating "search queries" for each word. Apple will ask: "What context do I need?"

Apple's Q Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_Q:

Q[0] = 0.6×0.5 + 0.4×0.2 + 1.0×0.7 + 0.2×0.4 = 0.3 + 0.08 + 0.7 + 0.08 = 1.160
Q[1] = 0.6×0.3 + 0.4×0.8 + 1.0×0.1 + 0.2×0.6 = 0.18 + 0.32 + 0.1 + 0.12 = 0.720
Q[2] = 0.6×0.1 + 0.4×0.4 + 1.0×0.6 + 0.2×0.2 = 0.06 + 0.16 + 0.6 + 0.04 = 0.860
Q[3] = 0.6×0.2 + 0.4×0.1 + 1.0×0.3 + 0.2×0.8 = 0.12 + 0.04 + 0.3 + 0.16 = 0.620

Apple's Query: [1.160, 0.720, 0.860, 0.620]

✅ All Q Vectors Complete

Word	Q Vector (4D)	Meaning
I	[1.010, 0.690, 1.120, 0.890]	"Who am I acting upon?"
bought	[1.050, 1.430, 0.980, 1.240]	"What was purchased?"
apple	[1.160, 0.720, 0.860, 0.620]	"What context defines me?"
to	[0.750, 1.150, 0.640, 1.090]	"What's my purpose?"
eat	[1.250, 1.390, 1.180, 1.160]	"What am I the action for?"

🔑 STEP 4: Calculate K Vectors (Keys)

Now you're creating "information advertisements" - what context each word can provide.

Your Turn: Calculate Apple's K Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_K

K[0] = 0.6×0.6 + 0.4×0.3 + 1.0×0.8 + 0.2×0.1 = ?

✅ All K Vectors Complete

Word	K Vector (4D)	Advertisement
I	[1.090, 0.890, 0.770, 0.840]	"I provide agent context!"
bought	[1.080, 1.700, 0.650, 1.420]	"I provide action context!"
apple	[1.300, 0.940, 0.920, 0.960]	"I provide object context!"
to	[0.630, 1.350, 0.580, 1.210]	"I provide purpose context!"
eat	[1.140, 1.680, 0.890, 1.380]	"I provide FOOD context!"

💎 STEP 5: Calculate V Vectors (Values)

These are the actual "information payloads" each word will contribute to the final meaning.

✅ All V Vectors Complete

Word	V Vector (4D)	Information Content
I	[1.290, 0.710, 0.820, 0.950]	Agent information
bought	[1.600, 1.100, 0.690, 1.340]	Purchase information
apple	[1.040, 1.160, 0.880, 0.780]	Object information
to	[1.150, 0.850, 0.740, 1.060]	Purpose information
eat	[1.830, 1.070, 0.920, 1.450]	CONSUMPTION information

🎯 STEP 6: Calculate Attention Scores

Now the magic happens! We measure how much each word's query matches every other word's key using dot products. This creates a 5×5 matrix of attention scores.

Dot Product Formula (4D):
Q·K = Q[0]×K[0] + Q[1]×K[1] + Q[2]×K[2] + Q[3]×K[3]

Reminder - All Q and K Vectors:
Q Vectors:
• I: [1.010, 0.690, 1.120, 0.890]
• bought: [1.050, 1.430, 0.980, 1.240]
• apple: [1.160, 0.720, 0.860, 0.620]
• to: [0.750, 1.150, 0.640, 1.090]
• eat: [1.250, 1.390, 1.180, 1.160]

K Vectors:
• I: [1.090, 0.890, 0.770, 0.840]
• bought: [1.080, 1.700, 0.650, 1.420]
• apple: [1.300, 0.940, 0.920, 0.960]
• to: [0.630, 1.350, 0.580, 1.210]
• eat: [1.140, 1.680, 0.890, 1.380]

📊 Complete 5×5 Raw Attention Scores Matrix

Each cell shows the dot product between the row word's Query vector and the column word's Key vector. Apple's row is highlighted in red!

Query → Key ↓	I	bought	apple	to	eat
I	2.876	4.324	3.088	2.745	4.601 ⭐
bought	3.245	5.012	3.916	3.198	5.287 ⭐
apple	3.088	3.916	3.571	2.952	4.153 ⭐
to	2.654	4.187	2.952	2.890	4.368 ⭐
eat	3.567	5.234	4.153	3.721	5.679 ⭐

🎯 Verify a Calculation: "I" → "bought"

I_query · bought_key = [1.010, 0.690, 1.120, 0.890] · [1.080, 1.700, 0.650, 1.420]

= 1.010×1.080 + 0.690×1.700 + 1.120×0.650 + 0.890×1.420 = ?

Hint: 1.091 + 1.173 + 0.728 + 1.267 = 4.324

🍎 Apple's Detailed Attention Calculations

Apple Query: [1.160, 0.720, 0.860, 0.620]

Apple → eat:
[1.160, 0.720, 0.860, 0.620] · [1.140, 1.680, 0.890, 1.380]
= 1.160×1.140 + 0.720×1.680 + 0.860×0.890 + 0.620×1.380
= 1.322 + 1.210 + 0.765 + 0.856 = 4.153 ⭐ HIGHEST!

Apple → bought:
= 1.160×1.080 + 0.720×1.700 + 0.860×0.650 + 0.620×1.420
= 1.253 + 1.224 + 0.559 + 0.880 = 3.916

Apple → apple (self):
= 1.160×1.300 + 0.720×0.940 + 0.860×0.920 + 0.620×0.960
= 1.508 + 0.677 + 0.791 + 0.595 = 3.571

Apple → I:
= 1.160×1.090 + 0.720×0.890 + 0.860×0.770 + 0.620×0.840
= 1.264 + 0.641 + 0.662 + 0.521 = 3.088

Apple → to:
= 1.160×0.630 + 0.720×1.350 + 0.860×0.580 + 0.620×1.210
= 0.731 + 0.972 + 0.499 + 0.750 = 2.952

🔍 Key Observations:
• Every word attends most strongly to "eat" (see the green highlighted column)
• This makes sense - "eat" is the most informative word for understanding the sentence context
• "eat" also has the highest self-attention (5.679), showing it's very confident in its own meaning
• Apple's second-highest attention is to "bought" (3.916), connecting the purchase action to the object

🧠 Quick Check: What do these scores tell us?

Question: Apple's highest attention score is with "eat" (4.153). What does this suggest?

📐 STEP 7: Scale by √d_k

We divide ALL attention scores by √4 = 2 to prevent them from getting too large (which would make softmax too "sharp" and reduce learning).

Scaling Formula:
d_k = dimension of key vectors = 4
Scale Factor = √d_k = √4 = 2
Scaled Score = Raw Score ÷ 2

📊 Complete 5×5 Scaled Scores Matrix

All raw attention scores divided by √4 = 2. Notice how all values are now smaller and more manageable. Apple's row is highlighted in red!

Query → Key ↓	I	bought	apple	to	eat
I	1.438	2.162	1.544	1.373	2.301 ⭐
bought	1.623	2.506	1.958	1.599	2.644 ⭐
apple	1.544	1.958	1.786	1.476	2.077 ⭐
to	1.327	2.094	1.476	1.445	2.184 ⭐
eat	1.784	2.617	2.077	1.861	2.840 ⭐

🎯 Practice Scaling

Scale "bought" → "bought": 5.012 ÷ 2 = ?

🍎 Apple's Complete Scaled Scores

Apple's raw scores → scaled scores:
Apple → I: 3.088 ÷ 2 = 1.544
Apple → bought: 3.916 ÷ 2 = 1.958
Apple → apple: 3.571 ÷ 2 = 1.786
Apple → to: 2.952 ÷ 2 = 1.476
Apple → eat: 4.153 ÷ 2 = 2.077 ⭐ Still highest!

📊 Why Scale?
• Before scaling: Scores ranged from 2.654 to 5.679 (large range)
• After scaling: Scores range from 1.327 to 2.840 (smaller, more manageable)
• Benefit: Prevents softmax from becoming too "peaky" and allows better gradient flow during training
• Pattern preserved: The relative ordering of attention scores remains the same

🎲 STEP 8: Apply Softmax

Softmax converts ALL scaled scores into probabilities that sum to 1.0 for each word. This is like giving each word a "attention budget" to distribute among all other words.

Softmax Steps for Each Row:
1. Calculate e^(each scaled score in the row)
2. Sum all the exponentials in that row
3. Divide each exponential by the row sum
4. Result: Each row sums to exactly 1.0

📊 Complete 5×5 Softmax Attention Weights Matrix

All scaled scores converted to probabilities using softmax. Each row sums to 1.0 (100% attention budget). Apple's row is highlighted in red!

Query → Key ↓	I	bought	apple	to	eat	Row Sum
I	0.142	0.309	0.156	0.132	0.261 ⭐	1.000
bought	0.125	0.286	0.167	0.118	0.304 ⭐	1.000
apple	0.156	0.235	0.198	0.146	0.265 ⭐	1.000
to	0.134	0.271	0.146	0.141	0.308 ⭐	1.000
eat	0.133	0.265	0.178	0.142	0.282 ⭐	1.000

🎯 Practice Softmax: "I" Row

Scaled scores for "I": [1.438, 2.162, 1.544, 1.373, 2.301]

Exponentials: [4.21, 8.69, 4.68, 3.95, 9.98]

Sum = 31.51. What's the softmax weight for "I" → "bought"?

8.69 ÷ 31.51 = ?

🍎 Apple's Complete Softmax Calculation

Apple's scaled scores: [1.544, 1.958, 1.786, 1.476, 2.077]

Step 1: Calculate exponentials
e^1.544 = 4.68
e^1.958 = 7.08
e^1.786 = 5.97
e^1.476 = 4.38
e^2.077 = 7.98

Step 2: Sum = 4.68 + 7.08 + 5.97 + 4.38 + 7.98 = 30.09

Step 3: Final attention weights (Apple's attention budget)
Apple → I: 4.68 ÷ 30.09 = 0.156 (15.6%)
Apple → bought: 7.08 ÷ 30.09 = 0.235 (23.5%)
Apple → apple: 5.97 ÷ 30.09 = 0.198 (19.8%)
Apple → to: 4.38 ÷ 30.09 = 0.146 (14.6%)
Apple → eat: 7.98 ÷ 30.09 = 0.265 (26.5%) ⭐ WINNER!

✅ Verification: 0.156 + 0.235 + 0.198 + 0.146 + 0.265 = 1.000

🔍 All Words' Attention Patterns:
• "I" pays most attention to "eat" (26.1%) - The subject focuses on the action
• "bought" pays most attention to "eat" (30.4%) - The purchase action connects to consumption
• "apple" pays most attention to "eat" (26.5%) - The object understands it's food
• "to" pays most attention to "eat" (30.8%) - The preposition links to the purpose
• "eat" pays most attention to itself (28.2%) - High self-confidence as the key action

🎯 Universal Pattern: Every word in this sentence realizes that "eat" is the most important context!

🎯 Attention Budget Verification

Check that "bought" row sums to 1.0:

0.125 + 0.286 + 0.167 + 0.118 + 0.304 = ?

Softmax Steps:
1. Calculate e^(each score)
2. Sum all the exponentials
3. Divide each exponential by the sum