๐Ÿงฎ Interactive Self Attention Walkthrough

Dear Student, You're about to learn how attention works in transformers by calculating everything step-by-step. Follow along, try the interactive exercises, and see how "apple" gets its meaning from context!

Step 0 of 9
"I bought apple to eat"

๐Ÿ“‹ STEP 1: Understanding Your Data

Your task: We'll track how the word "apple" (highlighted in red) pays attention to other words to understand if it's a fruit or a company.

Word Embeddings (4-dimensional vectors)

Think of these as coordinates in 4D space that represent each word's meaning.
Position Word Embedding Vector [eโ‚, eโ‚‚, eโ‚ƒ, eโ‚„]
1 I [1.0, 0.2, 0.5, 0.3]
2 bought [0.8, 1.0, 0.3, 0.6]
3 apple [0.6, 0.4, 1.0, 0.2] โ† YOUR FOCUS WORD
4 to [0.3, 0.7, 0.2, 0.8]
5 eat [0.9, 0.6, 0.4, 1.0]

Weight Matrices (4ร—4 each)

These matrices transform word embeddings to create Query, Key, and Value vectors. Think of them as different "lenses" to view word relationships.
WQ (Query Matrix)
Transforms words into "search queries"
0.5
0.3
0.1
0.2
0.2
0.8
0.4
0.1
0.7
0.1
0.6
0.3
0.4
0.6
0.2
0.8
WK (Key Matrix)
Transforms words into "information advertisements"
0.6
0.4
0.2
0.3
0.3
0.9
0.1
0.5
0.8
0.2
0.7
0.4
0.1
0.7
0.3
0.9
WV (Value Matrix)
Transforms words into "information content"
0.9
0.1
0.3
0.4
0.4
0.6
0.2
0.7
0.2
0.8
0.5
0.1
0.7
0.3
0.6
0.8

๐Ÿงฎ STEP 2: Matrix Multiplication Rules

How to multiply a 4D vector with a 4ร—4 matrix:
If embedding = [eโ‚, eโ‚‚, eโ‚ƒ, eโ‚„] and Weight = 4ร—4 matrix

Result[0] = eโ‚ร—W[0,0] + eโ‚‚ร—W[1,0] + eโ‚ƒร—W[2,0] + eโ‚„ร—W[3,0]
Result[1] = eโ‚ร—W[0,1] + eโ‚‚ร—W[1,1] + eโ‚ƒร—W[2,1] + eโ‚„ร—W[3,1]
Result[2] = eโ‚ร—W[0,2] + eโ‚‚ร—W[1,2] + eโ‚ƒร—W[2,2] + eโ‚„ร—W[3,2]
Result[3] = eโ‚ร—W[0,3] + eโ‚‚ร—W[1,3] + eโ‚ƒร—W[2,3] + eโ‚„ร—W[3,3]

Try It Yourself!

Calculate the first component of apple's Q vector:

Apple embedding: [0.6, 0.4, 1.0, 0.2]

Q[0] = 0.6ร—0.5 + 0.4ร—0.2 + 1.0ร—0.7 + 0.2ร—0.4 = ?

Hint: 0.6ร—0.5 = 0.3, 0.4ร—0.2 = 0.08, 1.0ร—0.7 = 0.7, 0.2ร—0.4 = 0.08
Sum: 0.3 + 0.08 + 0.7 + 0.08 = 1.160

โšก STEP 3: Calculate Q Vectors (Query)

You're creating "search queries" for each word. Apple will ask: "What context do I need?"

Apple's Q Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_Q:

Q[0] = 0.6ร—0.5 + 0.4ร—0.2 + 1.0ร—0.7 + 0.2ร—0.4 = 0.3 + 0.08 + 0.7 + 0.08 = 1.160
Q[1] = 0.6ร—0.3 + 0.4ร—0.8 + 1.0ร—0.1 + 0.2ร—0.6 = 0.18 + 0.32 + 0.1 + 0.12 = 0.720
Q[2] = 0.6ร—0.1 + 0.4ร—0.4 + 1.0ร—0.6 + 0.2ร—0.2 = 0.06 + 0.16 + 0.6 + 0.04 = 0.860
Q[3] = 0.6ร—0.2 + 0.4ร—0.1 + 1.0ร—0.3 + 0.2ร—0.8 = 0.12 + 0.04 + 0.3 + 0.16 = 0.620

Apple's Query: [1.160, 0.720, 0.860, 0.620]

โœ… All Q Vectors Complete

WordQ Vector (4D)Meaning
I[1.010, 0.690, 1.120, 0.890]"Who am I acting upon?"
bought[1.050, 1.430, 0.980, 1.240]"What was purchased?"
apple[1.160, 0.720, 0.860, 0.620]"What context defines me?"
to[0.750, 1.150, 0.640, 1.090]"What's my purpose?"
eat[1.250, 1.390, 1.180, 1.160]"What am I the action for?"

๐Ÿ”‘ STEP 4: Calculate K Vectors (Keys)

Now you're creating "information advertisements" - what context each word can provide.

Your Turn: Calculate Apple's K Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_K

K[0] = 0.6ร—0.6 + 0.4ร—0.3 + 1.0ร—0.8 + 0.2ร—0.1 = ?

โœ… All K Vectors Complete

WordK Vector (4D)Advertisement
I[1.090, 0.890, 0.770, 0.840]"I provide agent context!"
bought[1.080, 1.700, 0.650, 1.420]"I provide action context!"
apple[1.300, 0.940, 0.920, 0.960]"I provide object context!"
to[0.630, 1.350, 0.580, 1.210]"I provide purpose context!"
eat[1.140, 1.680, 0.890, 1.380]"I provide FOOD context!"

๐Ÿ’Ž STEP 5: Calculate V Vectors (Values)

These are the actual "information payloads" each word will contribute to the final meaning.

โœ… All V Vectors Complete

WordV Vector (4D)Information Content
I[1.290, 0.710, 0.820, 0.950]Agent information
bought[1.600, 1.100, 0.690, 1.340]Purchase information
apple[1.040, 1.160, 0.880, 0.780]Object information
to[1.150, 0.850, 0.740, 1.060]Purpose information
eat[1.830, 1.070, 0.920, 1.450]CONSUMPTION information

๐ŸŽฏ STEP 6: Calculate Attention Scores

Now the magic happens! We measure how much each word's query matches every other word's key using dot products. This creates a 5ร—5 matrix of attention scores.
Dot Product Formula (4D):
QยทK = Q[0]ร—K[0] + Q[1]ร—K[1] + Q[2]ร—K[2] + Q[3]ร—K[3]
Reminder - All Q and K Vectors:
Q Vectors:
โ€ข I: [1.010, 0.690, 1.120, 0.890]
โ€ข bought: [1.050, 1.430, 0.980, 1.240]
โ€ข apple: [1.160, 0.720, 0.860, 0.620]
โ€ข to: [0.750, 1.150, 0.640, 1.090]
โ€ข eat: [1.250, 1.390, 1.180, 1.160]

K Vectors:
โ€ข I: [1.090, 0.890, 0.770, 0.840]
โ€ข bought: [1.080, 1.700, 0.650, 1.420]
โ€ข apple: [1.300, 0.940, 0.920, 0.960]
โ€ข to: [0.630, 1.350, 0.580, 1.210]
โ€ข eat: [1.140, 1.680, 0.890, 1.380]

๐Ÿ“Š Complete 5ร—5 Raw Attention Scores Matrix

Each cell shows the dot product between the row word's Query vector and the column word's Key vector. Apple's row is highlighted in red!
Query โ†’
Key โ†“
I bought apple to eat
I 2.876 4.324 3.088 2.745 4.601 โญ
bought 3.245 5.012 3.916 3.198 5.287 โญ
apple 3.088 3.916 3.571 2.952 4.153 โญ
to 2.654 4.187 2.952 2.890 4.368 โญ
eat 3.567 5.234 4.153 3.721 5.679 โญ

๐ŸŽฏ Verify a Calculation: "I" โ†’ "bought"

I_query ยท bought_key = [1.010, 0.690, 1.120, 0.890] ยท [1.080, 1.700, 0.650, 1.420]

= 1.010ร—1.080 + 0.690ร—1.700 + 1.120ร—0.650 + 0.890ร—1.420 = ?

Hint: 1.091 + 1.173 + 0.728 + 1.267 = 4.324

๐ŸŽ Apple's Detailed Attention Calculations

Apple Query: [1.160, 0.720, 0.860, 0.620]

Apple โ†’ eat:
[1.160, 0.720, 0.860, 0.620] ยท [1.140, 1.680, 0.890, 1.380]
= 1.160ร—1.140 + 0.720ร—1.680 + 0.860ร—0.890 + 0.620ร—1.380
= 1.322 + 1.210 + 0.765 + 0.856 = 4.153 โญ HIGHEST!

Apple โ†’ bought:
= 1.160ร—1.080 + 0.720ร—1.700 + 0.860ร—0.650 + 0.620ร—1.420
= 1.253 + 1.224 + 0.559 + 0.880 = 3.916

Apple โ†’ apple (self):
= 1.160ร—1.300 + 0.720ร—0.940 + 0.860ร—0.920 + 0.620ร—0.960
= 1.508 + 0.677 + 0.791 + 0.595 = 3.571

Apple โ†’ I:
= 1.160ร—1.090 + 0.720ร—0.890 + 0.860ร—0.770 + 0.620ร—0.840
= 1.264 + 0.641 + 0.662 + 0.521 = 3.088

Apple โ†’ to:
= 1.160ร—0.630 + 0.720ร—1.350 + 0.860ร—0.580 + 0.620ร—1.210
= 0.731 + 0.972 + 0.499 + 0.750 = 2.952
๐Ÿ” Key Observations:
โ€ข Every word attends most strongly to "eat" (see the green highlighted column)
โ€ข This makes sense - "eat" is the most informative word for understanding the sentence context
โ€ข "eat" also has the highest self-attention (5.679), showing it's very confident in its own meaning
โ€ข Apple's second-highest attention is to "bought" (3.916), connecting the purchase action to the object

๐Ÿง  Quick Check: What do these scores tell us?

Question: Apple's highest attention score is with "eat" (4.153). What does this suggest?

๐Ÿ“ STEP 7: Scale by โˆšd_k

We divide ALL attention scores by โˆš4 = 2 to prevent them from getting too large (which would make softmax too "sharp" and reduce learning).
Scaling Formula:
d_k = dimension of key vectors = 4
Scale Factor = โˆšd_k = โˆš4 = 2
Scaled Score = Raw Score รท 2

๐Ÿ“Š Complete 5ร—5 Scaled Scores Matrix

All raw attention scores divided by โˆš4 = 2. Notice how all values are now smaller and more manageable. Apple's row is highlighted in red!
Query โ†’
Key โ†“
I bought apple to eat
I 1.438 2.162 1.544 1.373 2.301 โญ
bought 1.623 2.506 1.958 1.599 2.644 โญ
apple 1.544 1.958 1.786 1.476 2.077 โญ
to 1.327 2.094 1.476 1.445 2.184 โญ
eat 1.784 2.617 2.077 1.861 2.840 โญ

๐ŸŽฏ Practice Scaling

Scale "bought" โ†’ "bought": 5.012 รท 2 = ?

๐ŸŽ Apple's Complete Scaled Scores

Apple's raw scores โ†’ scaled scores:
Apple โ†’ I: 3.088 รท 2 = 1.544
Apple โ†’ bought: 3.916 รท 2 = 1.958
Apple โ†’ apple: 3.571 รท 2 = 1.786
Apple โ†’ to: 2.952 รท 2 = 1.476
Apple โ†’ eat: 4.153 รท 2 = 2.077 โญ Still highest!
๐Ÿ“Š Why Scale?
โ€ข Before scaling: Scores ranged from 2.654 to 5.679 (large range)
โ€ข After scaling: Scores range from 1.327 to 2.840 (smaller, more manageable)
โ€ข Benefit: Prevents softmax from becoming too "peaky" and allows better gradient flow during training
โ€ข Pattern preserved: The relative ordering of attention scores remains the same

๐ŸŽฒ STEP 8: Apply Softmax

Softmax converts ALL scaled scores into probabilities that sum to 1.0 for each word. This is like giving each word a "attention budget" to distribute among all other words.
Softmax Steps for Each Row:
1. Calculate e^(each scaled score in the row)
2. Sum all the exponentials in that row
3. Divide each exponential by the row sum
4. Result: Each row sums to exactly 1.0

๐Ÿ“Š Complete 5ร—5 Softmax Attention Weights Matrix

All scaled scores converted to probabilities using softmax. Each row sums to 1.0 (100% attention budget). Apple's row is highlighted in red!
Query โ†’
Key โ†“
I bought apple to eat Row Sum
I 0.142 0.309 0.156 0.132 0.261 โญ 1.000
bought 0.125 0.286 0.167 0.118 0.304 โญ 1.000
apple 0.156 0.235 0.198 0.146 0.265 โญ 1.000
to 0.134 0.271 0.146 0.141 0.308 โญ 1.000
eat 0.133 0.265 0.178 0.142 0.282 โญ 1.000

๐ŸŽฏ Practice Softmax: "I" Row

Scaled scores for "I": [1.438, 2.162, 1.544, 1.373, 2.301]

Exponentials: [4.21, 8.69, 4.68, 3.95, 9.98]

Sum = 31.51. What's the softmax weight for "I" โ†’ "bought"?

8.69 รท 31.51 = ?

๐ŸŽ Apple's Complete Softmax Calculation

Apple's scaled scores: [1.544, 1.958, 1.786, 1.476, 2.077]

Step 1: Calculate exponentials
e^1.544 = 4.68
e^1.958 = 7.08
e^1.786 = 5.97
e^1.476 = 4.38
e^2.077 = 7.98

Step 2: Sum = 4.68 + 7.08 + 5.97 + 4.38 + 7.98 = 30.09

Step 3: Final attention weights (Apple's attention budget)
Apple โ†’ I: 4.68 รท 30.09 = 0.156 (15.6%)
Apple โ†’ bought: 7.08 รท 30.09 = 0.235 (23.5%)
Apple โ†’ apple: 5.97 รท 30.09 = 0.198 (19.8%)
Apple โ†’ to: 4.38 รท 30.09 = 0.146 (14.6%)
Apple โ†’ eat: 7.98 รท 30.09 = 0.265 (26.5%) โญ WINNER!

โœ… Verification: 0.156 + 0.235 + 0.198 + 0.146 + 0.265 = 1.000
๐Ÿ” All Words' Attention Patterns:
โ€ข "I" pays most attention to "eat" (26.1%) - The subject focuses on the action
โ€ข "bought" pays most attention to "eat" (30.4%) - The purchase action connects to consumption
โ€ข "apple" pays most attention to "eat" (26.5%) - The object understands it's food
โ€ข "to" pays most attention to "eat" (30.8%) - The preposition links to the purpose
โ€ข "eat" pays most attention to itself (28.2%) - High self-confidence as the key action

๐ŸŽฏ Universal Pattern: Every word in this sentence realizes that "eat" is the most important context!

๐ŸŽฏ Attention Budget Verification

Check that "bought" row sums to 1.0:

0.125 + 0.286 + 0.167 + 0.118 + 0.304 = ?

Softmax Steps:
1. Calculate e^(each score)
2. Sum all the exponentials
3. Divide each exponential by the sum

๐ŸŽ Apple's Softmax Calculation

Step 1: Calculate exponentials
e^1.544 = 4.68
e^1.958 = 7.08
e^1.786 = 5.97
e^1.476 = 4.38
e^2.077 = 7.98

Step 2: Sum = 4.68 + 7.08 + 5.97 + 4.38 + 7.98 = 30.09

Step 3: Final attention weights
Apple โ†’ I: 4.68 รท 30.09 = 0.156 (15.6%)
Apple โ†’ bought: 7.08 รท 30.09 = 0.235 (23.5%)
Apple โ†’ apple: 5.97 รท 30.09 = 0.198 (19.8%)
Apple โ†’ to: 4.38 รท 30.09 = 0.146 (14.6%)
Apple โ†’ eat: 7.98 รท 30.09 = 0.265 (26.5%) โญ WINNER!

๐ŸŽฏ Attention Budget Check

Verify the attention weights sum to 1.0:

0.156 + 0.235 + 0.198 + 0.146 + 0.265 = ?

๐ŸŽฏ STEP 9: Final Contextual Vector

Now we create apple's new meaning by taking a weighted average of all value vectors using the attention weights.
Final Output:
Output = ฮฃ(attention_weight_i ร— V_i)
Calculate each component separately

๐ŸŽ Apple's Final Transformation

Weighted Value Vectors:
0.156 ร— [1.290, 0.710, 0.820, 0.950] = [0.201, 0.111, 0.128, 0.148]
0.235 ร— [1.600, 1.100, 0.690, 1.340] = [0.376, 0.259, 0.162, 0.315]
0.198 ร— [1.040, 1.160, 0.880, 0.780] = [0.206, 0.230, 0.174, 0.154]
0.146 ร— [1.150, 0.850, 0.740, 1.060] = [0.168, 0.124, 0.108, 0.155]
0.265 ร— [1.830, 1.070, 0.920, 1.450] = [0.485, 0.284, 0.244, 0.384]

Sum each component:
Component 1: 0.201 + 0.376 + 0.206 + 0.168 + 0.485 = 1.436
Component 2: 0.111 + 0.259 + 0.230 + 0.124 + 0.284 = 1.008
Component 3: 0.128 + 0.162 + 0.174 + 0.108 + 0.244 = 0.816
Component 4: 0.148 + 0.315 + 0.154 + 0.155 + 0.384 = 1.156

๐Ÿ† CONGRATULATIONS! You've Completed the Full Calculation

Apple's New 4D Meaning: [1.436, 1.008, 0.816, 1.156]

๐Ÿ” What You Discovered:

๐ŸŽ Apple's attention distribution shows: