๐Ÿง  Interactive Multi-Head Attention Walkthrough

Dear Student, You're about to explore the power of multi-head attention in transformers! We'll break down how the word "apple" gains a rich understanding through three specialized attention heads, each focusing on different relationships. Follow the steps, try the exercises, and see the magic of parallel processing!

Step 0 of 18
"I bought apple to eat"

๐Ÿ“‹ STEP 1: Understanding the Input Data

Your task: We'll compute how "apple" (highlighted in red) uses three attention heads to understand its context as a fruit, focusing on semantic, syntactic, and purposive relationships.

Word Embeddings (4-dimensional vectors)

These vectors represent each word's initial meaning in a 4D space, shared across all attention heads.
Position Word Embedding Vector [eโ‚, eโ‚‚, eโ‚ƒ, eโ‚„]
1 I [1.0, 0.2, 0.5, 0.3]
2 bought [0.7, 0.9, 0.4, 0.8]
3 apple [0.6, 0.4, 1.0, 0.2] โ† YOUR FOCUS WORD
4 to [0.2, 0.5, 0.3, 0.6]
5 eat [0.8, 1.0, 0.3, 0.6]

๐Ÿงฎ STEP 2: Matrix Multiplication Rules

How to multiply a 4D vector with a 4ร—4 matrix:
If embedding = [eโ‚, eโ‚‚, eโ‚ƒ, eโ‚„] and Weight = 4ร—4 matrix

Result[0] = eโ‚ร—W[0,0] + eโ‚‚ร—W[1,0] + eโ‚ƒร—W[2,0] + eโ‚„ร—W[3,0]
Result[1] = eโ‚ร—W[0,1] + eโ‚‚ร—W[1,1] + eโ‚ƒร—W[2,1] + eโ‚„ร—W[3,1]
Result[2] = eโ‚ร—W[0,2] + eโ‚‚ร—W[1,2] + eโ‚ƒร—W[2,2] + eโ‚„ร—W[3,2]
Result[3] = eโ‚ร—W[0,3] + eโ‚‚ร—W[1,3] + eโ‚ƒร—W[2,3] + eโ‚„ร—W[3,3]

Try It Yourself!

Calculate the first component of apple's Qโ‚ vector for Head 1:

Apple embedding: [0.6, 0.4, 1.0, 0.2]

W_Q1[0] = [0.8, 0.1, 0.6, 0.3]

Qโ‚[0] = 0.6ร—0.8 + 0.4ร—0.1 + 1.0ร—0.6 + 0.2ร—0.3 = ?

Hint: 0.6ร—0.8 = 0.48, 0.4ร—0.1 = 0.04, 1.0ร—0.6 = 0.6, 0.2ร—0.3 = 0.06
Sum: 0.48 + 0.04 + 0.6 + 0.06 = 0.81

๐Ÿ” STEP 3: Head 1 - Semantic Relationships (Query Vectors)

Head 1 focuses on semantic relationships, transforming embeddings into queries to find meaning-based connections.

Weight Matrices for Head 1 (4ร—4)

WQ1 (Query Matrix)
Transforms words into semantic "search queries"
0.8
0.2
0.1
0.3
0.1
0.9
0.4
0.2
0.6
0.4
0.7
0.5
0.3
0.7
0.2
0.8
WK1 (Key Matrix)
Transforms words into semantic "information advertisements"
0.7
0.3
0.2
0.4
0.4
0.6
0.5
0.3
0.9
0.1
0.8
0.2
0.2
0.8
0.3
0.7
WV1 (Value Matrix)
Transforms words into semantic "information content"
0.5
0.5
0.3
0.2
0.8
0.2
0.6
0.4
0.3
0.7
0.5
0.1
0.6
0.4
0.2
0.8

Apple's Qโ‚ Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_Q1:

Qโ‚[0] = 0.6ร—0.8 + 0.4ร—0.1 + 1.0ร—0.6 + 0.2ร—0.3 = 0.48 + 0.04 + 0.6 + 0.06 = 0.81
Qโ‚[1] = 0.6ร—0.2 + 0.4ร—0.9 + 1.0ร—0.4 + 0.2ร—0.7 = 0.12 + 0.36 + 0.4 + 0.14 = 0.67
Qโ‚[2] = 0.6ร—0.1 + 0.4ร—0.4 + 1.0ร—0.7 + 0.2ร—0.2 = 0.06 + 0.16 + 0.7 + 0.04 = 0.89
Qโ‚[3] = 0.6ร—0.3 + 0.4ร—0.2 + 1.0ร—0.5 + 0.2ร—0.8 = 0.18 + 0.08 + 0.5 + 0.16 = 0.73

Apple's Query (Head 1): [0.81, 0.67, 0.89, 0.73]

โœ… All Qโ‚ Vectors Complete

WordQโ‚ Vector (4D)Meaning
I[0.95, 0.52, 0.75, 0.62]"Who am I acting upon?"
bought[0.82, 0.95, 0.69, 0.98]"What was purchased?"
apple[0.81, 0.67, 0.89, 0.73]"What semantic context defines me?"
to[0.45, 0.58, 0.47, 0.61]"What's my purpose?"
eat[0.87, 0.92, 0.66, 0.94]"What am I the action for?"

๐Ÿ”‘ STEP 4: Head 1 - Semantic Relationships (Key Vectors)

Key vectors in Head 1 advertise what semantic information each word can provide.

Your Turn: Calculate Apple's Kโ‚ Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_K1

Kโ‚[0] = 0.6ร—0.7 + 0.4ร—0.4 + 1.0ร—0.9 + 0.2ร—0.2 = ?

โœ… All Kโ‚ Vectors Complete

WordKโ‚ Vector (4D)Advertisement
I[1.22, 0.76, 0.81, 0.68]"I provide agent context!"
bought[1.15, 0.99, 0.72, 0.93]"I provide action context!"
apple[1.61, 0.83, 0.94, 0.78]"I provide object context!"
to[0.62, 0.67, 0.49, 0.65]"I provide purpose context!"
eat[1.13, 0.96, 0.69, 0.89]"I provide FOOD context!"

๐Ÿ’Ž STEP 5: Head 1 - Semantic Relationships (Value Vectors)

Value vectors in Head 1 carry the semantic information each word contributes.

โœ… All Vโ‚ Vectors Complete

WordVโ‚ Vector (4D)Information Content
I[0.72, 0.64, 0.55, 0.47]Agent information
bought[0.61, 0.78, 0.49, 0.66]Purchase information
apple[0.58, 0.62, 0.52, 0.43]Object information
to[0.44, 0.51, 0.38, 0.49]Purpose information
eat[0.59, 0.75, 0.46, 0.62]CONSUMPTION information

๐ŸŽฏ STEP 6: Head 1 - Calculate Attention Scores

We compute dot products between each word's query vector and all key vectors to find semantic relationships. The table below shows the attention scores for all query-key pairs in Head 1, with the highest score for each query highlighted.
Dot Product Formula (4D):
Qโ‚ยทKโ‚ = Qโ‚[0]ร—Kโ‚[0] + Qโ‚[1]ร—Kโ‚[1] + Qโ‚[2]ร—Kโ‚[2] + Qโ‚[3]ร—Kโ‚[3]

Attention Scores Matrix (Head 1)

Query \ Key I [1.22, 0.76, 0.81, 0.68] bought [1.15, 0.99, 0.72, 0.93] apple [1.61, 0.83, 0.94, 0.78] to [0.62, 0.67, 0.49, 0.65] eat [1.13, 0.96, 0.69, 0.89]
I [0.95, 0.52, 0.75, 0.62] 2.42 2.40 2.37 1.45 2.36
bought [0.82, 0.95, 0.69, 0.98] 2.81 3.01 2.77 1.76 2.85
apple [0.81, 0.67, 0.89, 0.73] 2.71 2.92 3.27 1.86 2.82
to [0.45, 0.58, 0.47, 0.61] 1.57 1.74 1.59 1.76 1.71
eat [0.87, 0.92, 0.66, 0.94] 2.76 2.96 2.72 1.74 2.87

Apple's Attention Scores (Head 1)

Apple Query: [0.81, 0.67, 0.89, 0.73]

Apple โ†’ I:
[0.81, 0.67, 0.89, 0.73] ยท [1.22, 0.76, 0.81, 0.68] = 0.81ร—1.22 + 0.67ร—0.76 + 0.89ร—0.81 + 0.73ร—0.68
= 0.988 + 0.509 + 0.721 + 0.496 = 2.714

Apple โ†’ bought:
= 0.81ร—1.15 + 0.67ร—0.99 + 0.89ร—0.72 + 0.73ร—0.93 = 0.932 + 0.663 + 0.641 + 0.679 = 2.915

Apple โ†’ apple:
= 0.81ร—1.61 + 0.67ร—0.83 + 0.89ร—0.94 + 0.73ร—0.78 = 1.304 + 0.556 + 0.837 + 0.569 = 3.266 โญ HIGHEST!

Apple โ†’ to:
= 0.81ร—0.62 + 0.67ร—0.67 + 0.89ร—0.49 + 0.73ร—0.65 = 0.502 + 0.449 + 0.436 + 0.475 = 1.862

Apple โ†’ eat:
= 0.81ร—1.13 + 0.67ร—0.96 + 0.89ร—0.69 + 0.73ร—0.89 = 0.915 + 0.643 + 0.614 + 0.650 = 2.822

๐Ÿ“ STEP 7: Head 1 - Scale and Softmax

We scale attention scores by โˆš4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word. The table below shows the attention weights for all query-key pairs in Head 1, with the highest weight for each query highlighted.
Scaling and Softmax:
Scaled Score = Raw Score รท โˆš4 = Raw Score รท 2
Softmax: e^(Scaled Score) รท Sum(e^(Scaled Scores))

Attention Weights Matrix (Head 1)

Query \ Key I bought apple to eat
I 0.208 0.204 0.199 0.080 0.197
bought 0.202 0.245 0.194 0.070 0.209
apple 0.151 0.184 0.162 0.121 0.382
to 0.170 0.202 0.175 0.206 0.196
eat 0.201 0.243 0.191 0.072 0.223

Practice Softmax: Head 1, Apple โ†’ eat

Raw score: 2.822, Scaled: 2.822 รท 2 = 1.411

Exponentials: [e^1.357, e^1.458, e^1.633, e^0.931, e^1.411] = [3.885, 4.301, 5.121, 2.537, 4.101]

Sum = 19.945, Softmax = 4.101 รท 19.945 = ?

Apple's Attention Weights (Head 1)

WordWeight
I0.151
bought0.184
apple0.162
to0.121
eat0.382 โญ

๐Ÿ“ STEP 8: Head 1 - Output Vector

We compute the weighted sum of value vectors using attention weights to get Head 1's output for "apple".

Apple's Output (Head 1)

Weighted Value Vectors:
0.151 ร— [0.72, 0.64, 0.55, 0.47] = [0.109, 0.097, 0.083, 0.071]
0.184 ร— [0.61, 0.78, 0.49, 0.66] = [0.112, 0.143, 0.090, 0.121]
0.162 ร— [0.58, 0.62, 0.52, 0.43] = [0.094, 0.100, 0.084, 0.070]
0.121 ร— [0.44, 0.51, 0.38, 0.49] = [0.053, 0.062, 0.046, 0.059]
0.382 ร— [0.59, 0.75, 0.46, 0.62] = [0.225, 0.287, 0.176, 0.237]

Sum:
Component 1: 0.109 + 0.112 + 0.094 + 0.053 + 0.225 = 0.58
Component 2: 0.097 + 0.143 + 0.100 + 0.062 + 0.287 = 0.46
Component 3: 0.083 + 0.090 + 0.084 + 0.046 + 0.176 = 0.37
Component 4: 0.071 + 0.121 + 0.070 + 0.059 + 0.237 = 0.46

Output: [0.58, 0.46, 0.37, 0.46]

๐Ÿ“ STEP 9: Heads 2 & 3 - Positional/Syntactic and Contextual Purpose

Heads 2 and 3 follow the same process as Head 1 but focus on syntactic structure and contextual purpose, respectively. Let's compute the full process for each head.

๐Ÿ” STEP 9.1: Head 2 - Positional/Syntactic (Query Vectors)

Head 2 focuses on positional and syntactic relationships, transforming embeddings into queries to find structural connections in the sentence.

Weight Matrices for Head 2 (4ร—4)

WQ2 (Query Matrix)
Transforms words into syntactic "search queries"
0.6
0.3
0.2
0.4
0.2
0.7
0.5
0.1
0.8
0.4
0.6
0.3
0.5
0.1
0.3
0.9
WK2 (Key Matrix)
Transforms words into syntactic "information advertisements"
0.5
0.2
0.3
0.6
0.3
0.8
0.4
0.2
0.7
0.5
0.9
0.1
0.4
0.6
0.2
0.8
WV2 (Value Matrix)
Transforms words into syntactic "information content"
0.4
0.6
0.5
0.3
0.7
0.3
0.2
0.5
0.2
0.8
0.4
0.6
0.5
0.4
0.7
0.2

Apple's Qโ‚‚ Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_Q2:

Qโ‚‚[0] = 0.6ร—0.6 + 0.4ร—0.2 + 1.0ร—0.8 + 0.2ร—0.5 = 0.36 + 0.08 + 0.8 + 0.1 = 1.34
Qโ‚‚[1] = 0.6ร—0.3 + 0.4ร—0.7 + 1.0ร—0.4 + 0.2ร—0.1 = 0.18 + 0.28 + 0.4 + 0.02 = 0.88
Qโ‚‚[2] = 0.6ร—0.2 + 0.4ร—0.5 + 1.0ร—0.6 + 0.2ร—0.3 = 0.12 + 0.2 + 0.6 + 0.06 = 0.98
Qโ‚‚[3] = 0.6ร—0.4 + 0.4ร—0.1 + 1.0ร—0.3 + 0.2ร—0.9 = 0.24 + 0.04 + 0.3 + 0.18 = 0.76

Apple's Query (Head 2): [1.34, 0.88, 0.98, 0.76]

โœ… All Qโ‚‚ Vectors Complete

WordQโ‚‚ Vector (4D)Meaning
I[0.93, 0.51, 0.67, 0.58]"What is my syntactic role?"
bought[1.01, 0.96, 0.83, 0.84]"What action do I govern?"
apple[1.34, 0.88, 0.98, 0.76]"What is my syntactic position?"
to[0.53, 0.62, 0.51, 0.65]"What follows me syntactically?"
eat[0.95, 0.89, 0.71, 0.79]"What action am I linked to?"

๐Ÿ”‘ STEP 9.2: Head 2 - Positional/Syntactic (Key Vectors)

Key vectors in Head 2 advertise what syntactic information each word can provide.

Your Turn: Calculate Apple's Kโ‚‚ Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_K2

Kโ‚‚[0] = 0.6ร—0.5 + 0.4ร—0.3 + 1.0ร—0.7 + 0.2ร—0.4 = ?

โœ… All Kโ‚‚ Vectors Complete

WordKโ‚‚ Vector (4D)Advertisement
I[0.88, 0.62, 0.79, 0.55]"I provide subject context!"
bought[1.25, 0.94, 0.88, 0.72]"I provide verb context!"
apple[1.20, 0.79, 0.95, 0.68]"I provide object context!"
to[0.59, 0.65, 0.52, 0.62]"I provide preposition context!"
eat[0.92, 0.81, 0.74, 0.66]"I provide verb context!"

๐Ÿ’Ž STEP 9.3: Head 2 - Positional/Syntactic (Value Vectors)

Value vectors in Head 2 carry the syntactic information each word contributes.

โœ… All Vโ‚‚ Vectors Complete

WordVโ‚‚ Vector (4D)Information Content
I[0.65, 0.51, 0.42, 0.38]Subject information
bought[0.72, 0.49, 0.55, 0.41]Verb information
apple[0.68, 0.47, 0.53, 0.39]Object information
to[0.51, 0.42, 0.36, 0.45]Preposition information
eat[0.61, 0.46, 0.49, 0.37]Verb information

๐ŸŽฏ STEP 9.4: Head 2 - Calculate Attention Scores

We compute dot products between each word's query vector and all key vectors to find syntactic relationships. The table below shows the attention scores for all query-key pairs in Head 2, with the highest score for each query highlighted.

Attention Scores Matrix (Head 2)

Query \ Key I [0.88, 0.62, 0.79, 0.55] bought [1.25, 0.94, 0.88, 0.72] apple [1.20, 0.79, 0.95, 0.68] to [0.59, 0.65, 0.52, 0.62] eat [0.92, 0.81, 0.74, 0.66]
I [0.93, 0.51, 0.67, 0.58] 1.68 1.64 1.61 1.23 1.59
bought [1.01, 0.96, 0.83, 0.84] 1.92 2.05 1.97 1.49 1.90
apple [1.34, 0.88, 0.98, 0.76] 2.25 2.45 2.39 1.78 2.30
to [0.53, 0.62, 0.51, 0.65] 1.28 1.39 1.33 1.41 1.36
eat [0.95, 0.89, 0.71, 0.79] 1.83 1.97 1.89 1.46 1.92

Apple's Attention Scores (Head 2)

Apple Query: [1.34, 0.88, 0.98, 0.76]

Apple โ†’ I:
[1.34, 0.88, 0.98, 0.76] ยท [0.88, 0.62, 0.79, 0.55] = 1.34ร—0.88 + 0.88ร—0.62 + 0.98ร—0.79 + 0.76ร—0.55
= 1.179 + 0.546 + 0.774 + 0.418 = 2.917

Apple โ†’ bought:
= 1.34ร—1.25 + 0.88ร—0.94 + 0.98ร—0.88 + 0.76ร—0.72 = 1.675 + 0.827 + 0.862 + 0.547 = 3.911 โญ HIGHEST!

Apple โ†’ apple:
= 1.34ร—1.20 + 0.88ร—0.79 + 0.98ร—0.95 + 0.76ร—0.68 = 1.608 + 0.695 + 0.931 + 0.517 = 3.751

Apple โ†’ to:
= 1.34ร—0.59 + 0.88ร—0.65 + 0.98ร—0.52 + 0.76ร—0.62 = 0.791 + 0.572 + 0.510 + 0.471 = 2.344

Apple โ†’ eat:
= 1.34ร—0.92 + 0.88ร—0.81 + 0.98ร—0.74 + 0.76ร—0.66 = 1.233 + 0.712 + 0.725 + 0.502 = 3.172

๐Ÿ“ STEP 9.5: Head 2 - Scale and Softmax

We scale attention scores by โˆš4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word.

Attention Weights Matrix (Head 2)

Query \ Key I bought apple to eat
I 0.269 0.258 0.250 0.172 0.246
bought 0.223 0.315 0.234 0.146 0.217
apple 0.269 0.315 0.221 0.147 0.048
to 0.194 0.216 0.204 0.221 0.210
eat 0.221 0.256 0.236 0.153 0.245

Practice Softmax: Head 2, Apple โ†’ bought

Raw score: 3.911, Scaled: 3.911 รท 2 = 1.956

Exponentials: [e^1.459, e^1.956, e^1.876, e^1.172, e^1.586] = [4.304, 7.071, 6.529, 3.229, 4.885]

Sum = 26.018, Softmax = 7.071 รท 26.018 = ?

Apple's Attention Weights (Head 2)

WordWeight
I0.165
bought0.272 โญ
apple0.247
to0.116
eat0.200

๐Ÿ“ STEP 9.6: Head 2 - Output Vector

We compute the weighted sum of value vectors using attention weights to get Head 2's output for "apple".

Apple's Output (Head 2)

Weighted Value Vectors:
0.165 ร— [0.65, 0.51, 0.42, 0.38] = [0.107, 0.084, 0.069, 0.063]
0.272 ร— [0.72, 0.49, 0.55, 0.41] = [0.196, 0.133, 0.150, 0.112]
0.247 ร— [0.68, 0.47, 0.53, 0.39] = [0.168, 0.116, 0.131, 0.096]
0.116 ร— [0.51, 0.42, 0.36, 0.45] = [0.059, 0.049, 0.042, 0.052]
0.200 ร— [0.61, 0.46, 0.49, 0.37] = [0.122, 0.092, 0.098, 0.074]

Sum:
Component 1: 0.107 + 0.196 + 0.168 + 0.059 + 0.122 = 0.61
Component 2: 0.084 + 0.133 + 0.116 + 0.049 + 0.092 = 0.42
Component 3: 0.069 + 0.150 + 0.131 + 0.042 + 0.098 = 0.47
Component 4: 0.063 + 0.112 + 0.096 + 0.052 + 0.074 = 0.32

Output: [0.61, 0.42, 0.47, 0.32]

๐Ÿ” STEP 9.7: Head 3 - Contextual Purpose (Query Vectors)

Head 3 focuses on contextual purpose, transforming embeddings into queries to find purpose-driven connections in the sentence.

Weight Matrices for Head 3 (4ร—4)

WQ3 (Query Matrix)
Transforms words into purpose-driven "search queries"
0.7
0.4
0.3
0.5
0.3
0.8
0.6
0.2
0.9
0.5
0.7
0.4
0.6
0.2
0.4
0.8
WK3 (Key Matrix)
Transforms words into purpose-driven "information advertisements"
0.6
0.3
0.4
0.5
0.4
0.7
0.5
0.3
0.8
0.6
0.8
0.2
0.5
0.7
0.3
0.9
WV3 (Value Matrix)
Transforms words into purpose-driven "information content"
0.5
0.7
0.4
0.3
0.8
0.4
0.5
0.6
0.3
0.9
0.5
0.7
0.6
0.5
0.6
0.4

Apple's Qโ‚ƒ Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_Q3:

Qโ‚ƒ[0] = 0.6ร—0.7 + 0.4ร—0.3 + 1.0ร—0.9 + 0.2ร—0.6 = 0.42 + 0.12 + 0.9 + 0.12 = 1.57
Qโ‚ƒ[1] = 0.6ร—0.4 + 0.4ร—0.8 + 1.0ร—0.5 + 0.2ร—0.2 = 0.24 + 0.32 + 0.5 + 0.04 = 1.10
Qโ‚ƒ[2] = 0.6ร—0.3 + 0.4ร—0.6 + 1.0ร—0.7 + 0.2ร—0.4 = 0.18 + 0.24 + 0.7 + 0.08 = 1.20
Qโ‚ƒ[3] = 0.6ร—0.5 + 0.4ร—0.2 + 1.0ร—0.4 + 0.2ร—0.8 = 0.3 + 0.08 + 0.4 + 0.16 = 0.94

Apple's Query (Head 3): [1.57, 1.10, 1.20, 0.94]

โœ… All Qโ‚ƒ Vectors Complete

WordQโ‚ƒ Vector (4D)Meaning
I[1.05, 0.63, 0.79, 0.67]"What is my purpose?"
bought[1.13, 0.99, 0.87, 0.89]"What is the purpose of my action?"
apple[1.57, 1.10, 1.20, 0.94]"What is my contextual purpose?"
to[0.61, 0.67, 0.56, 0.69]"What purpose do I serve?"
eat[1.01, 0.94, 0.76, 0.84]"What is my action's purpose?"

๐Ÿ”‘ STEP 9.8: Head 3 - Contextual Purpose (Key Vectors)

Key vectors in Head 3 advertise what purpose-driven information each word can provide.

Your Turn: Calculate Apple's Kโ‚ƒ Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] ร— W_K3

Kโ‚ƒ[0] = 0.6ร—0.6 + 0.4ร—0.4 + 1.0ร—0.8 + 0.2ร—0.5 = ?

โœ… All Kโ‚ƒ Vectors Complete

WordKโ‚ƒ Vector (4D)Advertisement
I[0.99, 0.71, 0.84, 0.62]"I provide agent purpose!"
bought[1.30, 0.98, 0.92, 0.77]"I provide action purpose!"
apple[1.42, 0.84, 0.99, 0.73]"I provide object purpose!"
to[0.64, 0.69, 0.57, 0.67]"I provide preposition purpose!"
eat[0.97, 0.86, 0.79, 0.71]"I provide action purpose!"

๐Ÿ’Ž STEP 9.9: Head 3 - Contextual Purpose (Value Vectors)

Value vectors in Head 3 carry the purpose-driven information each word contributes.

โœ… All Vโ‚ƒ Vectors Complete

WordVโ‚ƒ Vector (4D)Information Content
I[0.67, 0.53, 0.44, 0.39]Agent purpose
bought[0.74, 0.51, 0.57, 0.43]Action purpose
apple[0.70, 0.49, 0.55, 0.41]Object purpose
to[0.53, 0.44, 0.38, 0.47]Preposition purpose
eat[0.63, 0.48, 0.51, 0.39]Action purpose

๐ŸŽฏ STEP 9.10: Head 3 - Calculate Attention Scores

We compute dot products between each word's query vector and all key vectors to find purpose-driven relationships.

Attention Scores Matrix (Head 3)

Query \ Key I [0.99, 0.71, 0.84, 0.62] bought [1.30, 0.98, 0.92, 0.77] apple [1.42, 0.84, 0.99, 0.73] to [0.64, 0.69, 0.57, 0.67] eat [0.97, 0.86, 0.79, 0.71]
I [1.05, 0.63, 0.79, 0.67] 1.95 1.91 1.88 1.31 1.85
bought [1.13, 0.99, 0.87, 0.89] 2.15 2.28 2.20 1.47 2.13
apple [1.57, 1.10, 1.20, 0.94] 2.53 2.73 2.67 1.70 2.64
to [0.61, 0.67, 0.56, 0.69] 1.49 1.60 1.54 1.62 1.57
eat [1.01, 0.94, 0.76, 0.84] 2.07 2.20 2.12 1.44 2.09

Apple's Attention Scores (Head 3)

Apple Query: [1.57, 1.10, 1.20, 0.94]

Apple โ†’ I:
[1.57, 1.10, 1.20, 0.94] ยท [0.99, 0.71, 0.84, 0.62] = 1.57ร—0.99 + 1.10ร—0.71 + 1.20ร—0.84 + 0.94ร—0.62
= 1.554 + 0.781 + 1.008 + 0.583 = 3.926

Apple โ†’ bought:
= 1.57ร—1.30 + 1.10ร—0.98 + 1.20ร—0.92 + 0.94ร—0.77 = 2.041 + 1.078 + 1.104 + 0.724 = 4.947

Apple โ†’ apple:
= 1.57ร—1.42 + 1.10ร—0.84 + 1.20ร—0.99 + 0.94ร—0.73 = 2.229 + 0.924 + 1.188 + 0.686 = 5.027

Apple โ†’ to:
= 1.57ร—0.64 + 1.10ร—0.69 + 1.20ร—0.57 + 0.94ร—0.67 = 1.005 + 0.759 + 0.684 + 0.630 = 3.078

Apple โ†’ eat:
= 1.57ร—0.97 + 1.10ร—0.86 + 1.20ร—0.79 + 0.94ร—0.71 = 1.523 + 0.946 + 0.948 + 0.667 = 4.084 โญ HIGHEST!

๐Ÿ“ STEP 9.11: Head 3 - Scale and Softmax

We scale attention scores by โˆš4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word.

Attention Weights Matrix (Head 3)

Query \ Key I bought apple to eat
I 0.241 0.232 0.225 0.127 0.218
bought 0.228 0.260 0.239 0.115 0.223
apple 0.141 0.351 0.151 0.104 0.253
to 0.187 0.209 0.197 0.213 0.203
eat 0.225 0.257 0.237 0.119 0.229

Practice Softmax: Head 3, Apple โ†’ eat

Raw score: 4.084, Scaled: 4.084 รท 2 = 2.042

Exponentials: [e^1.963, e^2.474, e^2.514, e^1.539, e^2.042] = [7.125, 11.870, 12.353, 4.657, 7.706]

Sum = 43.711, Softmax = 7.706 รท 43.711 = ?

Apple's Attention Weights (Head 3)

WordWeight
I0.163
bought0.204
apple0.177
to0.121
eat0.335 โญ

๐Ÿ“ STEP 9.12: Head 3 - Output Vector

We compute the weighted sum of value vectors using attention weights to get Head 3's output for "apple".

Apple's Output (Head 3)

Weighted Value Vectors:
0.163 ร— [0.67, 0.53, 0.44, 0.39] = [0.109, 0.086, 0.072, 0.064]
0.204 ร— [0.74, 0.51, 0.57, 0.43] = [0.151, 0.104, 0.116, 0.088]
0.177 ร— [0.70, 0.49, 0.55, 0.41] = [0.124, 0.087, 0.097, 0.073]
0.121 ร— [0.53, 0.44, 0.38, 0.47] = [0.064, 0.053, 0.046, 0.057]

0.335 ร— [0.63, 0.48, 0.51, 0.39] = [0.211, 0.161, 0.171, 0.131]

Sum:
Component 1: 0.109 + 0.151 + 0.124 + 0.064 + 0.211 = 0.66
Component 2: 0.086 + 0.104 + 0.087 + 0.053 + 0.161 = 0.49
Component 3: 0.072 + 0.116 + 0.097 + 0.046 + 0.171 = 0.50
Component 4: 0.064 + 0.088 + 0.073 + 0.057 + 0.131 = 0.41

Output: [0.66, 0.49, 0.50, 0.41]

๐Ÿ”— STEP 10: Concatenate Head Outputs

We combine the outputs from all three heads for "apple" to form a 12D vector, capturing semantic, syntactic, and purposive insights.

Apple's Multi-Head Output

Head 1 (Semantic): [0.58, 0.46, 0.37, 0.46]
Head 2 (Syntactic): [0.61, 0.42, 0.47, 0.32]
Head 3 (Purpose): [0.66, 0.49, 0.50, 0.41]

Concatenated: [0.58, 0.46, 0.37, 0.46, 0.61, 0.42, 0.47, 0.32, 0.66, 0.49, 0.50, 0.41]

๐ŸŽ‰ STEP 11: Understanding the Result

The 12D vector represents "apple" with enriched context: This multi-faceted representation enables the transformer to understand "apple" in context!

โ“ Quiz Time!

Test Your Knowledge

Q1: Why does Head 1 give high attention to "eat" for "apple"?

a) "Eat" is the subject
b) "Eat" provides semantic context as consumption
c) "Eat" is syntactically closest

Q2: What does the softmax step ensure?

a) Scores are scaled
b) Weights sum to 1.0
c) Vectors are 4D