Home > Multi-Head Attention Mechanism
๐ง Interactive Multi-Head Attention Walkthrough
Dear Student, You're about to explore the power of multi-head attention in transformers! We'll break down how the word "apple" gains a rich understanding through three specialized attention heads, each focusing on different relationships. Follow the steps, try the exercises, and see the magic of parallel processing!
Step 0 of 18
"I bought apple to eat"
๐ STEP 1: Understanding the Input Data
Your task: We'll compute how "apple" (highlighted in red) uses three attention heads to understand its context as a fruit, focusing on semantic, syntactic, and purposive relationships.
Word Embeddings (4-dimensional vectors)
These vectors represent each word's initial meaning in a 4D space, shared across all attention heads.
Position
Word
Embedding Vector [eโ, eโ, eโ, eโ]
1
I
[1.0, 0.2, 0.5, 0.3]
2
bought
[0.7, 0.9, 0.4, 0.8]
3
apple
[0.6, 0.4, 1.0, 0.2] โ YOUR FOCUS WORD
4
to
[0.2, 0.5, 0.3, 0.6]
5
eat
[0.8, 1.0, 0.3, 0.6]
โ Complete Step 1
๐งฎ STEP 2: Matrix Multiplication Rules
How to multiply a 4D vector with a 4ร4 matrix:
If embedding = [eโ, eโ, eโ, eโ] and Weight = 4ร4 matrix
Result[0] = eโรW[0,0] + eโรW[1,0] + eโรW[2,0] + eโรW[3,0]
Result[1] = eโรW[0,1] + eโรW[1,1] + eโรW[2,1] + eโรW[3,1]
Result[2] = eโรW[0,2] + eโรW[1,2] + eโรW[2,2] + eโรW[3,2]
Result[3] = eโรW[0,3] + eโรW[1,3] + eโรW[2,3] + eโรW[3,3]
Try It Yourself!
Calculate the first component of apple's Qโ vector for Head 1:
Apple embedding: [0.6, 0.4, 1.0, 0.2]
W_Q1[0] = [0.8, 0.1, 0.6, 0.3]
Qโ[0] = 0.6ร0.8 + 0.4ร0.1 + 1.0ร0.6 + 0.2ร0.3 = ?
Check
Show Hint
Hint: 0.6ร0.8 = 0.48, 0.4ร0.1 = 0.04, 1.0ร0.6 = 0.6, 0.2ร0.3 = 0.06
Sum: 0.48 + 0.04 + 0.6 + 0.06 = 0.81
โ Complete Step 2
๐ STEP 3: Head 1 - Semantic Relationships (Query Vectors)
Head 1 focuses on semantic relationships, transforming embeddings into queries to find meaning-based connections.
Weight Matrices for Head 1 (4ร4)
WQ1 (Query Matrix)
Transforms words into semantic "search queries"
0.8
0.2
0.1
0.3
0.1
0.9
0.4
0.2
0.6
0.4
0.7
0.5
0.3
0.7
0.2
0.8
WK1 (Key Matrix)
Transforms words into semantic "information advertisements"
0.7
0.3
0.2
0.4
0.4
0.6
0.5
0.3
0.9
0.1
0.8
0.2
0.2
0.8
0.3
0.7
WV1 (Value Matrix)
Transforms words into semantic "information content"
0.5
0.5
0.3
0.2
0.8
0.2
0.6
0.4
0.3
0.7
0.5
0.1
0.6
0.4
0.2
0.8
Apple's Qโ Vector Calculation
Apple embedding [0.6, 0.4, 1.0, 0.2] ร W_Q1:
Qโ[0] = 0.6ร0.8 + 0.4ร0.1 + 1.0ร0.6 + 0.2ร0.3 = 0.48 + 0.04 + 0.6 + 0.06 = 0.81
Qโ[1] = 0.6ร0.2 + 0.4ร0.9 + 1.0ร0.4 + 0.2ร0.7 = 0.12 + 0.36 + 0.4 + 0.14 = 0.67
Qโ[2] = 0.6ร0.1 + 0.4ร0.4 + 1.0ร0.7 + 0.2ร0.2 = 0.06 + 0.16 + 0.7 + 0.04 = 0.89
Qโ[3] = 0.6ร0.3 + 0.4ร0.2 + 1.0ร0.5 + 0.2ร0.8 = 0.18 + 0.08 + 0.5 + 0.16 = 0.73
Apple's Query (Head 1): [0.81, 0.67, 0.89, 0.73]
โ
All Qโ Vectors Complete
Word Qโ Vector (4D) Meaning
I [0.95, 0.52, 0.75, 0.62] "Who am I acting upon?"
bought [0.82, 0.95, 0.69, 0.98] "What was purchased?"
apple [0.81, 0.67, 0.89, 0.73] "What semantic context defines me?"
to [0.45, 0.58, 0.47, 0.61] "What's my purpose?"
eat [0.87, 0.92, 0.66, 0.94] "What am I the action for?"
โ Complete Step 3
๐ STEP 4: Head 1 - Semantic Relationships (Key Vectors)
Key vectors in Head 1 advertise what semantic information each word can provide.
Your Turn: Calculate Apple's Kโ Vector!
Apple embedding [0.6, 0.4, 1.0, 0.2] ร W_K1
Kโ[0] = 0.6ร0.7 + 0.4ร0.4 + 1.0ร0.9 + 0.2ร0.2 = ?
Check Kโ[0]
โ
All Kโ Vectors Complete
Word Kโ Vector (4D) Advertisement
I [1.22, 0.76, 0.81, 0.68] "I provide agent context!"
bought [1.15, 0.99, 0.72, 0.93] "I provide action context!"
apple [1.61, 0.83, 0.94, 0.78] "I provide object context!"
to [0.62, 0.67, 0.49, 0.65] "I provide purpose context!"
eat [1.13, 0.96, 0.69, 0.89] "I provide FOOD context!"
โ Complete Step 4
๐ STEP 5: Head 1 - Semantic Relationships (Value Vectors)
Value vectors in Head 1 carry the semantic information each word contributes.
โ
All Vโ Vectors Complete
Word Vโ Vector (4D) Information Content
I [0.72, 0.64, 0.55, 0.47] Agent information
bought [0.61, 0.78, 0.49, 0.66] Purchase information
apple [0.58, 0.62, 0.52, 0.43] Object information
to [0.44, 0.51, 0.38, 0.49] Purpose information
eat [0.59, 0.75, 0.46, 0.62] CONSUMPTION information
โ Complete Step 5
๐ฏ STEP 6: Head 1 - Calculate Attention Scores
We compute dot products between each word's query vector and all key vectors to find semantic relationships. The table below shows the attention scores for all query-key pairs in Head 1, with the highest score for each query highlighted.
Dot Product Formula (4D):
QโยทKโ = Qโ[0]รKโ[0] + Qโ[1]รKโ[1] + Qโ[2]รKโ[2] + Qโ[3]รKโ[3]
Attention Scores Matrix (Head 1)
Query \ Key
I [1.22, 0.76, 0.81, 0.68]
bought [1.15, 0.99, 0.72, 0.93]
apple [1.61, 0.83, 0.94, 0.78]
to [0.62, 0.67, 0.49, 0.65]
eat [1.13, 0.96, 0.69, 0.89]
I [0.95, 0.52, 0.75, 0.62]
2.42
2.40
2.37
1.45
2.36
bought [0.82, 0.95, 0.69, 0.98]
2.81
3.01
2.77
1.76
2.85
apple [0.81, 0.67, 0.89, 0.73]
2.71
2.92
3.27
1.86
2.82
to [0.45, 0.58, 0.47, 0.61]
1.57
1.74
1.59
1.76
1.71
eat [0.87, 0.92, 0.66, 0.94]
2.76
2.96
2.72
1.74
2.87
Apple's Attention Scores (Head 1)
Apple Query: [0.81, 0.67, 0.89, 0.73]
Apple โ I:
[0.81, 0.67, 0.89, 0.73] ยท [1.22, 0.76, 0.81, 0.68] = 0.81ร1.22 + 0.67ร0.76 + 0.89ร0.81 + 0.73ร0.68
= 0.988 + 0.509 + 0.721 + 0.496 = 2.714
Apple โ bought:
= 0.81ร1.15 + 0.67ร0.99 + 0.89ร0.72 + 0.73ร0.93 = 0.932 + 0.663 + 0.641 + 0.679 = 2.915
Apple โ apple:
= 0.81ร1.61 + 0.67ร0.83 + 0.89ร0.94 + 0.73ร0.78 = 1.304 + 0.556 + 0.837 + 0.569 = 3.266 โญ HIGHEST!
Apple โ to:
= 0.81ร0.62 + 0.67ร0.67 + 0.89ร0.49 + 0.73ร0.65 = 0.502 + 0.449 + 0.436 + 0.475 = 1.862
Apple โ eat:
= 0.81ร1.13 + 0.67ร0.96 + 0.89ร0.69 + 0.73ร0.89 = 0.915 + 0.643 + 0.614 + 0.650 = 2.822
โ Complete Step 6
๐ STEP 7: Head 1 - Scale and Softmax
We scale attention scores by โ4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word. The table below shows the attention weights for all query-key pairs in Head 1, with the highest weight for each query highlighted.
Scaling and Softmax:
Scaled Score = Raw Score รท โ4 = Raw Score รท 2
Softmax: e^(Scaled Score) รท Sum(e^(Scaled Scores))
Attention Weights Matrix (Head 1)
Query \ Key
I
bought
apple
to
eat
I
0.208
0.204
0.199
0.080
0.197
bought
0.202
0.245
0.194
0.070
0.209
apple
0.151
0.184
0.162
0.121
0.382
to
0.170
0.202
0.175
0.206
0.196
eat
0.201
0.243
0.191
0.072
0.223
Practice Softmax: Head 1, Apple โ eat
Raw score: 2.822, Scaled: 2.822 รท 2 = 1.411
Exponentials: [e^1.357, e^1.458, e^1.633, e^0.931, e^1.411] = [3.885, 4.301, 5.121, 2.537, 4.101]
Sum = 19.945, Softmax = 4.101 รท 19.945 = ?
Check
Apple's Attention Weights (Head 1)
Word Weight
I 0.151
bought 0.184
apple 0.162
to 0.121
eat 0.382 โญ
โ Complete Step 7
๐ STEP 8: Head 1 - Output Vector
We compute the weighted sum of value vectors using attention weights to get Head 1's output for "apple".
Apple's Output (Head 1)
Weighted Value Vectors:
0.151 ร [0.72, 0.64, 0.55, 0.47] = [0.109, 0.097, 0.083, 0.071]
0.184 ร [0.61, 0.78, 0.49, 0.66] = [0.112, 0.143, 0.090, 0.121]
0.162 ร [0.58, 0.62, 0.52, 0.43] = [0.094, 0.100, 0.084, 0.070]
0.121 ร [0.44, 0.51, 0.38, 0.49] = [0.053, 0.062, 0.046, 0.059]
0.382 ร [0.59, 0.75, 0.46, 0.62] = [0.225, 0.287, 0.176, 0.237]
Sum:
Component 1: 0.109 + 0.112 + 0.094 + 0.053 + 0.225 = 0.58
Component 2: 0.097 + 0.143 + 0.100 + 0.062 + 0.287 = 0.46
Component 3: 0.083 + 0.090 + 0.084 + 0.046 + 0.176 = 0.37
Component 4: 0.071 + 0.121 + 0.070 + 0.059 + 0.237 = 0.46
Output: [0.58, 0.46, 0.37, 0.46]
โ Complete Step 8
๐ STEP 9: Heads 2 & 3 - Positional/Syntactic and Contextual Purpose
Heads 2 and 3 follow the same process as Head 1 but focus on syntactic structure and contextual purpose, respectively. Let's compute the full process for each head.
๐ STEP 9.1: Head 2 - Positional/Syntactic (Query Vectors)
Head 2 focuses on positional and syntactic relationships, transforming embeddings into queries to find structural connections in the sentence.
Weight Matrices for Head 2 (4ร4)
WQ2 (Query Matrix)
Transforms words into syntactic "search queries"
0.6
0.3
0.2
0.4
0.2
0.7
0.5
0.1
0.8
0.4
0.6
0.3
0.5
0.1
0.3
0.9
WK2 (Key Matrix)
Transforms words into syntactic "information advertisements"
0.5
0.2
0.3
0.6
0.3
0.8
0.4
0.2
0.7
0.5
0.9
0.1
0.4
0.6
0.2
0.8
WV2 (Value Matrix)
Transforms words into syntactic "information content"
0.4
0.6
0.5
0.3
0.7
0.3
0.2
0.5
0.2
0.8
0.4
0.6
0.5
0.4
0.7
0.2
Apple's Qโ Vector Calculation
Apple embedding [0.6, 0.4, 1.0, 0.2] ร W_Q2:
Qโ[0] = 0.6ร0.6 + 0.4ร0.2 + 1.0ร0.8 + 0.2ร0.5 = 0.36 + 0.08 + 0.8 + 0.1 = 1.34
Qโ[1] = 0.6ร0.3 + 0.4ร0.7 + 1.0ร0.4 + 0.2ร0.1 = 0.18 + 0.28 + 0.4 + 0.02 = 0.88
Qโ[2] = 0.6ร0.2 + 0.4ร0.5 + 1.0ร0.6 + 0.2ร0.3 = 0.12 + 0.2 + 0.6 + 0.06 = 0.98
Qโ[3] = 0.6ร0.4 + 0.4ร0.1 + 1.0ร0.3 + 0.2ร0.9 = 0.24 + 0.04 + 0.3 + 0.18 = 0.76
Apple's Query (Head 2): [1.34, 0.88, 0.98, 0.76]
โ
All Qโ Vectors Complete
Word Qโ Vector (4D) Meaning
I [0.93, 0.51, 0.67, 0.58] "What is my syntactic role?"
bought [1.01, 0.96, 0.83, 0.84] "What action do I govern?"
apple [1.34, 0.88, 0.98, 0.76] "What is my syntactic position?"
to [0.53, 0.62, 0.51, 0.65] "What follows me syntactically?"
eat [0.95, 0.89, 0.71, 0.79] "What action am I linked to?"
โ Complete Step 9.1
๐ STEP 9.2: Head 2 - Positional/Syntactic (Key Vectors)
Key vectors in Head 2 advertise what syntactic information each word can provide.
Your Turn: Calculate Apple's Kโ Vector!
Apple embedding [0.6, 0.4, 1.0, 0.2] ร W_K2
Kโ[0] = 0.6ร0.5 + 0.4ร0.3 + 1.0ร0.7 + 0.2ร0.4 = ?
Check Kโ[0]
โ
All Kโ Vectors Complete
Word Kโ Vector (4D) Advertisement
I [0.88, 0.62, 0.79, 0.55] "I provide subject context!"
bought [1.25, 0.94, 0.88, 0.72] "I provide verb context!"
apple [1.20, 0.79, 0.95, 0.68] "I provide object context!"
to [0.59, 0.65, 0.52, 0.62] "I provide preposition context!"
eat [0.92, 0.81, 0.74, 0.66] "I provide verb context!"
โ Complete Step 9.2
๐ STEP 9.3: Head 2 - Positional/Syntactic (Value Vectors)
Value vectors in Head 2 carry the syntactic information each word contributes.
โ
All Vโ Vectors Complete
Word Vโ Vector (4D) Information Content
I [0.65, 0.51, 0.42, 0.38] Subject information
bought [0.72, 0.49, 0.55, 0.41] Verb information
apple [0.68, 0.47, 0.53, 0.39] Object information
to [0.51, 0.42, 0.36, 0.45] Preposition information
eat [0.61, 0.46, 0.49, 0.37] Verb information
โ Complete Step 9.3
๐ฏ STEP 9.4: Head 2 - Calculate Attention Scores
We compute dot products between each word's query vector and all key vectors to find syntactic relationships. The table below shows the attention scores for all query-key pairs in Head 2, with the highest score for each query highlighted.
Attention Scores Matrix (Head 2)
Query \ Key
I [0.88, 0.62, 0.79, 0.55]
bought [1.25, 0.94, 0.88, 0.72]
apple [1.20, 0.79, 0.95, 0.68]
to [0.59, 0.65, 0.52, 0.62]
eat [0.92, 0.81, 0.74, 0.66]
I [0.93, 0.51, 0.67, 0.58]
1.68
1.64
1.61
1.23
1.59
bought [1.01, 0.96, 0.83, 0.84]
1.92
2.05
1.97
1.49
1.90
apple [1.34, 0.88, 0.98, 0.76]
2.25
2.45
2.39
1.78
2.30
to [0.53, 0.62, 0.51, 0.65]
1.28
1.39
1.33
1.41
1.36
eat [0.95, 0.89, 0.71, 0.79]
1.83
1.97
1.89
1.46
1.92
Apple's Attention Scores (Head 2)
Apple Query: [1.34, 0.88, 0.98, 0.76]
Apple โ I:
[1.34, 0.88, 0.98, 0.76] ยท [0.88, 0.62, 0.79, 0.55] = 1.34ร0.88 + 0.88ร0.62 + 0.98ร0.79 + 0.76ร0.55
= 1.179 + 0.546 + 0.774 + 0.418 = 2.917
Apple โ bought:
= 1.34ร1.25 + 0.88ร0.94 + 0.98ร0.88 + 0.76ร0.72 = 1.675 + 0.827 + 0.862 + 0.547 = 3.911 โญ HIGHEST!
Apple โ apple:
= 1.34ร1.20 + 0.88ร0.79 + 0.98ร0.95 + 0.76ร0.68 = 1.608 + 0.695 + 0.931 + 0.517 = 3.751
Apple โ to:
= 1.34ร0.59 + 0.88ร0.65 + 0.98ร0.52 + 0.76ร0.62 = 0.791 + 0.572 + 0.510 + 0.471 = 2.344
Apple โ eat:
= 1.34ร0.92 + 0.88ร0.81 + 0.98ร0.74 + 0.76ร0.66 = 1.233 + 0.712 + 0.725 + 0.502 = 3.172
โ Complete Step 9.4
๐ STEP 9.5: Head 2 - Scale and Softmax
We scale attention scores by โ4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word.
Attention Weights Matrix (Head 2)
Query \ Key
I
bought
apple
to
eat
I
0.269
0.258
0.250
0.172
0.246
bought
0.223
0.315
0.234
0.146
0.217
apple
0.269
0.315
0.221
0.147
0.048
to
0.194
0.216
0.204
0.221
0.210
eat
0.221
0.256
0.236
0.153
0.245
Practice Softmax: Head 2, Apple โ bought
Raw score: 3.911, Scaled: 3.911 รท 2 = 1.956
Exponentials: [e^1.459, e^1.956, e^1.876, e^1.172, e^1.586] = [4.304, 7.071, 6.529, 3.229, 4.885]
Sum = 26.018, Softmax = 7.071 รท 26.018 = ?
Check
Apple's Attention Weights (Head 2)
Word Weight
I 0.165
bought 0.272 โญ
apple 0.247
to 0.116
eat 0.200
โ Complete Step 9.5
๐ STEP 9.6: Head 2 - Output Vector
We compute the weighted sum of value vectors using attention weights to get Head 2's output for "apple".
Apple's Output (Head 2)
Weighted Value Vectors:
0.165 ร [0.65, 0.51, 0.42, 0.38] = [0.107, 0.084, 0.069, 0.063]
0.272 ร [0.72, 0.49, 0.55, 0.41] = [0.196, 0.133, 0.150, 0.112]
0.247 ร [0.68, 0.47, 0.53, 0.39] = [0.168, 0.116, 0.131, 0.096]
0.116 ร [0.51, 0.42, 0.36, 0.45] = [0.059, 0.049, 0.042, 0.052]
0.200 ร [0.61, 0.46, 0.49, 0.37] = [0.122, 0.092, 0.098, 0.074]
Sum:
Component 1: 0.107 + 0.196 + 0.168 + 0.059 + 0.122 = 0.61
Component 2: 0.084 + 0.133 + 0.116 + 0.049 + 0.092 = 0.42
Component 3: 0.069 + 0.150 + 0.131 + 0.042 + 0.098 = 0.47
Component 4: 0.063 + 0.112 + 0.096 + 0.052 + 0.074 = 0.32
Output: [0.61, 0.42, 0.47, 0.32]
โ Complete Step 9.6
๐ STEP 9.7: Head 3 - Contextual Purpose (Query Vectors)
Head 3 focuses on contextual purpose, transforming embeddings into queries to find purpose-driven connections in the sentence.
Weight Matrices for Head 3 (4ร4)
WQ3 (Query Matrix)
Transforms words into purpose-driven "search queries"
0.7
0.4
0.3
0.5
0.3
0.8
0.6
0.2
0.9
0.5
0.7
0.4
0.6
0.2
0.4
0.8
WK3 (Key Matrix)
Transforms words into purpose-driven "information advertisements"
0.6
0.3
0.4
0.5
0.4
0.7
0.5
0.3
0.8
0.6
0.8
0.2
0.5
0.7
0.3
0.9
WV3 (Value Matrix)
Transforms words into purpose-driven "information content"
0.5
0.7
0.4
0.3
0.8
0.4
0.5
0.6
0.3
0.9
0.5
0.7
0.6
0.5
0.6
0.4
Apple's Qโ Vector Calculation
Apple embedding [0.6, 0.4, 1.0, 0.2] ร W_Q3:
Qโ[0] = 0.6ร0.7 + 0.4ร0.3 + 1.0ร0.9 + 0.2ร0.6 = 0.42 + 0.12 + 0.9 + 0.12 = 1.57
Qโ[1] = 0.6ร0.4 + 0.4ร0.8 + 1.0ร0.5 + 0.2ร0.2 = 0.24 + 0.32 + 0.5 + 0.04 = 1.10
Qโ[2] = 0.6ร0.3 + 0.4ร0.6 + 1.0ร0.7 + 0.2ร0.4 = 0.18 + 0.24 + 0.7 + 0.08 = 1.20
Qโ[3] = 0.6ร0.5 + 0.4ร0.2 + 1.0ร0.4 + 0.2ร0.8 = 0.3 + 0.08 + 0.4 + 0.16 = 0.94
Apple's Query (Head 3): [1.57, 1.10, 1.20, 0.94]
โ
All Qโ Vectors Complete
Word Qโ Vector (4D) Meaning
I [1.05, 0.63, 0.79, 0.67] "What is my purpose?"
bought [1.13, 0.99, 0.87, 0.89] "What is the purpose of my action?"
apple [1.57, 1.10, 1.20, 0.94] "What is my contextual purpose?"
to [0.61, 0.67, 0.56, 0.69] "What purpose do I serve?"
eat [1.01, 0.94, 0.76, 0.84] "What is my action's purpose?"
โ Complete Step 9.7
๐ STEP 9.8: Head 3 - Contextual Purpose (Key Vectors)
Key vectors in Head 3 advertise what purpose-driven information each word can provide.
Your Turn: Calculate Apple's Kโ Vector!
Apple embedding [0.6, 0.4, 1.0, 0.2] ร W_K3
Kโ[0] = 0.6ร0.6 + 0.4ร0.4 + 1.0ร0.8 + 0.2ร0.5 = ?
Check Kโ[0]
โ
All Kโ Vectors Complete
Word Kโ Vector (4D) Advertisement
I [0.99, 0.71, 0.84, 0.62] "I provide agent purpose!"
bought [1.30, 0.98, 0.92, 0.77] "I provide action purpose!"
apple [1.42, 0.84, 0.99, 0.73] "I provide object purpose!"
to [0.64, 0.69, 0.57, 0.67] "I provide preposition purpose!"
eat [0.97, 0.86, 0.79, 0.71] "I provide action purpose!"
โ Complete Step 9.8
๐ STEP 9.9: Head 3 - Contextual Purpose (Value Vectors)
Value vectors in Head 3 carry the purpose-driven information each word contributes.
โ
All Vโ Vectors Complete
Word Vโ Vector (4D) Information Content
I [0.67, 0.53, 0.44, 0.39] Agent purpose
bought [0.74, 0.51, 0.57, 0.43] Action purpose
apple [0.70, 0.49, 0.55, 0.41] Object purpose
to [0.53, 0.44, 0.38, 0.47] Preposition purpose
eat [0.63, 0.48, 0.51, 0.39] Action purpose
โ Complete Step 9.9
๐ฏ STEP 9.10: Head 3 - Calculate Attention Scores
We compute dot products between each word's query vector and all key vectors to find purpose-driven relationships.
Attention Scores Matrix (Head 3)
Query \ Key
I [0.99, 0.71, 0.84, 0.62]
bought [1.30, 0.98, 0.92, 0.77]
apple [1.42, 0.84, 0.99, 0.73]
to [0.64, 0.69, 0.57, 0.67]
eat [0.97, 0.86, 0.79, 0.71]
I [1.05, 0.63, 0.79, 0.67]
1.95
1.91
1.88
1.31
1.85
bought [1.13, 0.99, 0.87, 0.89]
2.15
2.28
2.20
1.47
2.13
apple [1.57, 1.10, 1.20, 0.94]
2.53
2.73
2.67
1.70
2.64
to [0.61, 0.67, 0.56, 0.69]
1.49
1.60
1.54
1.62
1.57
eat [1.01, 0.94, 0.76, 0.84]
2.07
2.20
2.12
1.44
2.09
Apple's Attention Scores (Head 3)
Apple Query: [1.57, 1.10, 1.20, 0.94]
Apple โ I:
[1.57, 1.10, 1.20, 0.94] ยท [0.99, 0.71, 0.84, 0.62] = 1.57ร0.99 + 1.10ร0.71 + 1.20ร0.84 + 0.94ร0.62
= 1.554 + 0.781 + 1.008 + 0.583 = 3.926
Apple โ bought:
= 1.57ร1.30 + 1.10ร0.98 + 1.20ร0.92 + 0.94ร0.77 = 2.041 + 1.078 + 1.104 + 0.724 = 4.947
Apple โ apple:
= 1.57ร1.42 + 1.10ร0.84 + 1.20ร0.99 + 0.94ร0.73 = 2.229 + 0.924 + 1.188 + 0.686 = 5.027
Apple โ to:
= 1.57ร0.64 + 1.10ร0.69 + 1.20ร0.57 + 0.94ร0.67 = 1.005 + 0.759 + 0.684 + 0.630 = 3.078
Apple โ eat:
= 1.57ร0.97 + 1.10ร0.86 + 1.20ร0.79 + 0.94ร0.71 = 1.523 + 0.946 + 0.948 + 0.667 = 4.084 โญ HIGHEST!
โ Complete Step 9.10
๐ STEP 9.11: Head 3 - Scale and Softmax
We scale attention scores by โ4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word.
Attention Weights Matrix (Head 3)
Query \ Key
I
bought
apple
to
eat
I
0.241
0.232
0.225
0.127
0.218
bought
0.228
0.260
0.239
0.115
0.223
apple
0.141
0.351
0.151
0.104
0.253
to
0.187
0.209
0.197
0.213
0.203
eat
0.225
0.257
0.237
0.119
0.229
Practice Softmax: Head 3, Apple โ eat
Raw score: 4.084, Scaled: 4.084 รท 2 = 2.042
Exponentials: [e^1.963, e^2.474, e^2.514, e^1.539, e^2.042] = [7.125, 11.870, 12.353, 4.657, 7.706]
Sum = 43.711, Softmax = 7.706 รท 43.711 = ?
Check
Apple's Attention Weights (Head 3)
Word Weight
I 0.163
bought 0.204
apple 0.177
to 0.121
eat 0.335 โญ
โ Complete Step 9.11
๐ STEP 9.12: Head 3 - Output Vector
We compute the weighted sum of value vectors using attention weights to get Head 3's output for "apple".
Apple's Output (Head 3)
Weighted Value Vectors:
0.163 ร [0.67, 0.53, 0.44, 0.39] = [0.109, 0.086, 0.072, 0.064]
0.204 ร [0.74, 0.51, 0.57, 0.43] = [0.151, 0.104, 0.116, 0.088]
0.177 ร [0.70, 0.49, 0.55, 0.41] = [0.124, 0.087, 0.097, 0.073]
0.121 ร [0.53, 0.44, 0.38, 0.47] = [0.064, 0.053, 0.046, 0.057]
0.335 ร [0.63, 0.48, 0.51, 0.39] = [0.211, 0.161, 0.171, 0.131]
Sum:
Component 1: 0.109 + 0.151 + 0.124 + 0.064 + 0.211 =
0.66
Component 2: 0.086 + 0.104 + 0.087 + 0.053 + 0.161 =
0.49
Component 3: 0.072 + 0.116 + 0.097 + 0.046 + 0.171 =
0.50
Component 4: 0.064 + 0.088 + 0.073 + 0.057 + 0.131 =
0.41
Output: [0.66, 0.49, 0.50, 0.41]
โ Complete Step 9.12
๐ STEP 10: Concatenate Head Outputs
We combine the outputs from all three heads for "apple" to form a 12D vector, capturing semantic, syntactic, and purposive insights.
Apple's Multi-Head Output
Head 1 (Semantic): [0.58, 0.46, 0.37, 0.46]
Head 2 (Syntactic): [0.61, 0.42, 0.47, 0.32]
Head 3 (Purpose): [0.66, 0.49, 0.50, 0.41]
Concatenated: [0.58, 0.46, 0.37, 0.46, 0.61, 0.42, 0.47, 0.32, 0.66, 0.49, 0.50, 0.41]
โ Complete Step 10
๐ STEP 11: Understanding the Result
The 12D vector represents "apple" with enriched context:
Semantic (Head 1) : Strong attention to "eat" (0.382) emphasizes apple as a fruit to be consumed.
Syntactic (Head 2) : High attention to "bought" (0.272) highlights apple's role as the object of purchase.
Purpose (Head 3) : Strong attention to "eat" (0.335) reinforces apple's purpose as food.
This multi-faceted representation enables the transformer to understand "apple" in context!
โ Complete Step 11
โ Quiz Time!
๐ Finish Tutorial