🧠 Interactive Multi-Head Attention Walkthrough

Dear Student, You're about to explore the power of multi-head attention in transformers! We'll break down how the word "apple" gains a rich understanding through three specialized attention heads, each focusing on different relationships. Follow the steps, try the exercises, and see the magic of parallel processing!

Step 0 of 18

"I bought apple to eat"

📋 STEP 1: Understanding the Input Data

Your task: We'll compute how "apple" (highlighted in red) uses three attention heads to understand its context as a fruit, focusing on semantic, syntactic, and purposive relationships.

Word Embeddings (4-dimensional vectors)

These vectors represent each word's initial meaning in a 4D space, shared across all attention heads.

Position	Word	Embedding Vector [e₁, e₂, e₃, e₄]
1	I	[1.0, 0.2, 0.5, 0.3]
2	bought	[0.7, 0.9, 0.4, 0.8]
3	apple	[0.6, 0.4, 1.0, 0.2] ← YOUR FOCUS WORD
4	to	[0.2, 0.5, 0.3, 0.6]
5	eat	[0.8, 1.0, 0.3, 0.6]

🧮 STEP 2: Matrix Multiplication Rules

How to multiply a 4D vector with a 4×4 matrix:
If embedding = [e₁, e₂, e₃, e₄] and Weight = 4×4 matrix

Result[0] = e₁×W[0,0] + e₂×W[1,0] + e₃×W[2,0] + e₄×W[3,0]
Result[1] = e₁×W[0,1] + e₂×W[1,1] + e₃×W[2,1] + e₄×W[3,1]
Result[2] = e₁×W[0,2] + e₂×W[1,2] + e₃×W[2,2] + e₄×W[3,2]
Result[3] = e₁×W[0,3] + e₂×W[1,3] + e₃×W[2,3] + e₄×W[3,3]

Try It Yourself!

Calculate the first component of apple's Q₁ vector for Head 1:

Apple embedding: [0.6, 0.4, 1.0, 0.2]

W_Q1[0] = [0.8, 0.1, 0.6, 0.3]

Q₁[0] = 0.6×0.8 + 0.4×0.1 + 1.0×0.6 + 0.2×0.3 = ?

Hint: 0.6×0.8 = 0.48, 0.4×0.1 = 0.04, 1.0×0.6 = 0.6, 0.2×0.3 = 0.06
Sum: 0.48 + 0.04 + 0.6 + 0.06 = 0.81

🔍 STEP 3: Head 1 - Semantic Relationships (Query Vectors)

Head 1 focuses on semantic relationships, transforming embeddings into queries to find meaning-based connections.

Weight Matrices for Head 1 (4×4)

W_Q1 (Query Matrix)

Transforms words into semantic "search queries"

0.8

0.2

0.1

0.3

0.1

0.9

0.4

0.2

0.6

0.4

0.7

0.5

0.3

0.7

0.2

0.8

W_K1 (Key Matrix)

Transforms words into semantic "information advertisements"

0.7

0.3

0.2

0.4

0.6

0.5

0.3

0.9

0.1

0.8

0.2

0.8

0.3

0.7

W_V1 (Value Matrix)

Transforms words into semantic "information content"

0.5

0.3

0.2

0.8

0.2

0.6

0.4

0.3

0.7

0.5

0.1

0.6

0.4

0.2

0.8

Apple's Q₁ Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_Q1:

Q₁[0] = 0.6×0.8 + 0.4×0.1 + 1.0×0.6 + 0.2×0.3 = 0.48 + 0.04 + 0.6 + 0.06 = 0.81
Q₁[1] = 0.6×0.2 + 0.4×0.9 + 1.0×0.4 + 0.2×0.7 = 0.12 + 0.36 + 0.4 + 0.14 = 0.67
Q₁[2] = 0.6×0.1 + 0.4×0.4 + 1.0×0.7 + 0.2×0.2 = 0.06 + 0.16 + 0.7 + 0.04 = 0.89
Q₁[3] = 0.6×0.3 + 0.4×0.2 + 1.0×0.5 + 0.2×0.8 = 0.18 + 0.08 + 0.5 + 0.16 = 0.73

Apple's Query (Head 1): [0.81, 0.67, 0.89, 0.73]

✅ All Q₁ Vectors Complete

Word	Q₁ Vector (4D)	Meaning
I	[0.95, 0.52, 0.75, 0.62]	"Who am I acting upon?"
bought	[0.82, 0.95, 0.69, 0.98]	"What was purchased?"
apple	[0.81, 0.67, 0.89, 0.73]	"What semantic context defines me?"
to	[0.45, 0.58, 0.47, 0.61]	"What's my purpose?"
eat	[0.87, 0.92, 0.66, 0.94]	"What am I the action for?"

🔑 STEP 4: Head 1 - Semantic Relationships (Key Vectors)

Key vectors in Head 1 advertise what semantic information each word can provide.

Your Turn: Calculate Apple's K₁ Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_K1

K₁[0] = 0.6×0.7 + 0.4×0.4 + 1.0×0.9 + 0.2×0.2 = ?

✅ All K₁ Vectors Complete

Word	K₁ Vector (4D)	Advertisement
I	[1.22, 0.76, 0.81, 0.68]	"I provide agent context!"
bought	[1.15, 0.99, 0.72, 0.93]	"I provide action context!"
apple	[1.61, 0.83, 0.94, 0.78]	"I provide object context!"
to	[0.62, 0.67, 0.49, 0.65]	"I provide purpose context!"
eat	[1.13, 0.96, 0.69, 0.89]	"I provide FOOD context!"

💎 STEP 5: Head 1 - Semantic Relationships (Value Vectors)

Value vectors in Head 1 carry the semantic information each word contributes.

✅ All V₁ Vectors Complete

Word	V₁ Vector (4D)	Information Content
I	[0.72, 0.64, 0.55, 0.47]	Agent information
bought	[0.61, 0.78, 0.49, 0.66]	Purchase information
apple	[0.58, 0.62, 0.52, 0.43]	Object information
to	[0.44, 0.51, 0.38, 0.49]	Purpose information
eat	[0.59, 0.75, 0.46, 0.62]	CONSUMPTION information

🎯 STEP 6: Head 1 - Calculate Attention Scores

We compute dot products between each word's query vector and all key vectors to find semantic relationships. The table below shows the attention scores for all query-key pairs in Head 1, with the highest score for each query highlighted.

Dot Product Formula (4D):
Q₁·K₁ = Q₁[0]×K₁[0] + Q₁[1]×K₁[1] + Q₁[2]×K₁[2] + Q₁[3]×K₁[3]

Attention Scores Matrix (Head 1)

Query \ Key	I [1.22, 0.76, 0.81, 0.68]	bought [1.15, 0.99, 0.72, 0.93]	apple [1.61, 0.83, 0.94, 0.78]	to [0.62, 0.67, 0.49, 0.65]	eat [1.13, 0.96, 0.69, 0.89]
I [0.95, 0.52, 0.75, 0.62]	2.42	2.40	2.37	1.45	2.36
bought [0.82, 0.95, 0.69, 0.98]	2.81	3.01	2.77	1.76	2.85
apple [0.81, 0.67, 0.89, 0.73]	2.71	2.92	3.27	1.86	2.82
to [0.45, 0.58, 0.47, 0.61]	1.57	1.74	1.59	1.76	1.71
eat [0.87, 0.92, 0.66, 0.94]	2.76	2.96	2.72	1.74	2.87

Apple's Attention Scores (Head 1)

Apple Query: [0.81, 0.67, 0.89, 0.73]

Apple → I:
[0.81, 0.67, 0.89, 0.73] · [1.22, 0.76, 0.81, 0.68] = 0.81×1.22 + 0.67×0.76 + 0.89×0.81 + 0.73×0.68
= 0.988 + 0.509 + 0.721 + 0.496 = 2.714

Apple → bought:
= 0.81×1.15 + 0.67×0.99 + 0.89×0.72 + 0.73×0.93 = 0.932 + 0.663 + 0.641 + 0.679 = 2.915

Apple → apple:
= 0.81×1.61 + 0.67×0.83 + 0.89×0.94 + 0.73×0.78 = 1.304 + 0.556 + 0.837 + 0.569 = 3.266 ⭐ HIGHEST!

Apple → to:
= 0.81×0.62 + 0.67×0.67 + 0.89×0.49 + 0.73×0.65 = 0.502 + 0.449 + 0.436 + 0.475 = 1.862

Apple → eat:
= 0.81×1.13 + 0.67×0.96 + 0.89×0.69 + 0.73×0.89 = 0.915 + 0.643 + 0.614 + 0.650 = 2.822

📐 STEP 7: Head 1 - Scale and Softmax

We scale attention scores by √4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word. The table below shows the attention weights for all query-key pairs in Head 1, with the highest weight for each query highlighted.

Scaling and Softmax:
Scaled Score = Raw Score ÷ √4 = Raw Score ÷ 2
Softmax: e^(Scaled Score) ÷ Sum(e^(Scaled Scores))

Attention Weights Matrix (Head 1)

Query \ Key	I	bought	apple	to	eat
I	0.208	0.204	0.199	0.080	0.197
bought	0.202	0.245	0.194	0.070	0.209
apple	0.151	0.184	0.162	0.121	0.382
to	0.170	0.202	0.175	0.206	0.196
eat	0.201	0.243	0.191	0.072	0.223

Practice Softmax: Head 1, Apple → eat

Raw score: 2.822, Scaled: 2.822 ÷ 2 = 1.411

Exponentials: [e^1.357, e^1.458, e^1.633, e^0.931, e^1.411] = [3.885, 4.301, 5.121, 2.537, 4.101]

Sum = 19.945, Softmax = 4.101 ÷ 19.945 = ?

Apple's Attention Weights (Head 1)

Word	Weight
I	0.151
bought	0.184
apple	0.162
to	0.121
eat	0.382 ⭐

📍 STEP 8: Head 1 - Output Vector

We compute the weighted sum of value vectors using attention weights to get Head 1's output for "apple".

Apple's Output (Head 1)

Weighted Value Vectors:
0.151 × [0.72, 0.64, 0.55, 0.47] = [0.109, 0.097, 0.083, 0.071]
0.184 × [0.61, 0.78, 0.49, 0.66] = [0.112, 0.143, 0.090, 0.121]
0.162 × [0.58, 0.62, 0.52, 0.43] = [0.094, 0.100, 0.084, 0.070]
0.121 × [0.44, 0.51, 0.38, 0.49] = [0.053, 0.062, 0.046, 0.059]
0.382 × [0.59, 0.75, 0.46, 0.62] = [0.225, 0.287, 0.176, 0.237]

Sum:
Component 1: 0.109 + 0.112 + 0.094 + 0.053 + 0.225 = 0.58
Component 2: 0.097 + 0.143 + 0.100 + 0.062 + 0.287 = 0.46
Component 3: 0.083 + 0.090 + 0.084 + 0.046 + 0.176 = 0.37
Component 4: 0.071 + 0.121 + 0.070 + 0.059 + 0.237 = 0.46

Output: [0.58, 0.46, 0.37, 0.46]

📍 STEP 9: Heads 2 & 3 - Positional/Syntactic and Contextual Purpose

Heads 2 and 3 follow the same process as Head 1 but focus on syntactic structure and contextual purpose, respectively. Let's compute the full process for each head.

🔍 STEP 9.1: Head 2 - Positional/Syntactic (Query Vectors)

Head 2 focuses on positional and syntactic relationships, transforming embeddings into queries to find structural connections in the sentence.

Weight Matrices for Head 2 (4×4)

W_Q2 (Query Matrix)

Transforms words into syntactic "search queries"

0.6

0.3

0.2

0.4

0.2

0.7

0.5

0.1

0.8

0.4

0.6

0.3

0.5

0.1

0.3

0.9

W_K2 (Key Matrix)

Transforms words into syntactic "information advertisements"

0.5

0.2

0.3

0.6

0.3

0.8

0.4

0.2

0.7

0.5

0.9

0.1

0.4

0.6

0.2

0.8

W_V2 (Value Matrix)

Transforms words into syntactic "information content"

0.4

0.6

0.5

0.3

0.7

0.3

0.2

0.5

0.2

0.8

0.4

0.6

0.5

0.4

0.7

0.2

Apple's Q₂ Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_Q2:

Q₂[0] = 0.6×0.6 + 0.4×0.2 + 1.0×0.8 + 0.2×0.5 = 0.36 + 0.08 + 0.8 + 0.1 = 1.34
Q₂[1] = 0.6×0.3 + 0.4×0.7 + 1.0×0.4 + 0.2×0.1 = 0.18 + 0.28 + 0.4 + 0.02 = 0.88
Q₂[2] = 0.6×0.2 + 0.4×0.5 + 1.0×0.6 + 0.2×0.3 = 0.12 + 0.2 + 0.6 + 0.06 = 0.98
Q₂[3] = 0.6×0.4 + 0.4×0.1 + 1.0×0.3 + 0.2×0.9 = 0.24 + 0.04 + 0.3 + 0.18 = 0.76

Apple's Query (Head 2): [1.34, 0.88, 0.98, 0.76]

✅ All Q₂ Vectors Complete

Word	Q₂ Vector (4D)	Meaning
I	[0.93, 0.51, 0.67, 0.58]	"What is my syntactic role?"
bought	[1.01, 0.96, 0.83, 0.84]	"What action do I govern?"
apple	[1.34, 0.88, 0.98, 0.76]	"What is my syntactic position?"
to	[0.53, 0.62, 0.51, 0.65]	"What follows me syntactically?"
eat	[0.95, 0.89, 0.71, 0.79]	"What action am I linked to?"

🔑 STEP 9.2: Head 2 - Positional/Syntactic (Key Vectors)

Key vectors in Head 2 advertise what syntactic information each word can provide.

Your Turn: Calculate Apple's K₂ Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_K2

K₂[0] = 0.6×0.5 + 0.4×0.3 + 1.0×0.7 + 0.2×0.4 = ?

✅ All K₂ Vectors Complete

Word	K₂ Vector (4D)	Advertisement
I	[0.88, 0.62, 0.79, 0.55]	"I provide subject context!"
bought	[1.25, 0.94, 0.88, 0.72]	"I provide verb context!"
apple	[1.20, 0.79, 0.95, 0.68]	"I provide object context!"
to	[0.59, 0.65, 0.52, 0.62]	"I provide preposition context!"
eat	[0.92, 0.81, 0.74, 0.66]	"I provide verb context!"

💎 STEP 9.3: Head 2 - Positional/Syntactic (Value Vectors)

Value vectors in Head 2 carry the syntactic information each word contributes.

✅ All V₂ Vectors Complete

Word	V₂ Vector (4D)	Information Content
I	[0.65, 0.51, 0.42, 0.38]	Subject information
bought	[0.72, 0.49, 0.55, 0.41]	Verb information
apple	[0.68, 0.47, 0.53, 0.39]	Object information
to	[0.51, 0.42, 0.36, 0.45]	Preposition information
eat	[0.61, 0.46, 0.49, 0.37]	Verb information

🎯 STEP 9.4: Head 2 - Calculate Attention Scores

We compute dot products between each word's query vector and all key vectors to find syntactic relationships. The table below shows the attention scores for all query-key pairs in Head 2, with the highest score for each query highlighted.

Attention Scores Matrix (Head 2)

Query \ Key	I [0.88, 0.62, 0.79, 0.55]	bought [1.25, 0.94, 0.88, 0.72]	apple [1.20, 0.79, 0.95, 0.68]	to [0.59, 0.65, 0.52, 0.62]	eat [0.92, 0.81, 0.74, 0.66]
I [0.93, 0.51, 0.67, 0.58]	1.68	1.64	1.61	1.23	1.59
bought [1.01, 0.96, 0.83, 0.84]	1.92	2.05	1.97	1.49	1.90
apple [1.34, 0.88, 0.98, 0.76]	2.25	2.45	2.39	1.78	2.30
to [0.53, 0.62, 0.51, 0.65]	1.28	1.39	1.33	1.41	1.36
eat [0.95, 0.89, 0.71, 0.79]	1.83	1.97	1.89	1.46	1.92

Apple's Attention Scores (Head 2)

Apple Query: [1.34, 0.88, 0.98, 0.76]

Apple → I:
[1.34, 0.88, 0.98, 0.76] · [0.88, 0.62, 0.79, 0.55] = 1.34×0.88 + 0.88×0.62 + 0.98×0.79 + 0.76×0.55
= 1.179 + 0.546 + 0.774 + 0.418 = 2.917

Apple → bought:
= 1.34×1.25 + 0.88×0.94 + 0.98×0.88 + 0.76×0.72 = 1.675 + 0.827 + 0.862 + 0.547 = 3.911 ⭐ HIGHEST!

Apple → apple:
= 1.34×1.20 + 0.88×0.79 + 0.98×0.95 + 0.76×0.68 = 1.608 + 0.695 + 0.931 + 0.517 = 3.751

Apple → to:
= 1.34×0.59 + 0.88×0.65 + 0.98×0.52 + 0.76×0.62 = 0.791 + 0.572 + 0.510 + 0.471 = 2.344

Apple → eat:
= 1.34×0.92 + 0.88×0.81 + 0.98×0.74 + 0.76×0.66 = 1.233 + 0.712 + 0.725 + 0.502 = 3.172

📐 STEP 9.5: Head 2 - Scale and Softmax

We scale attention scores by √4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word.

Attention Weights Matrix (Head 2)

Query \ Key	I	bought	apple	to	eat
I	0.269	0.258	0.250	0.172	0.246
bought	0.223	0.315	0.234	0.146	0.217
apple	0.269	0.315	0.221	0.147	0.048
to	0.194	0.216	0.204	0.221	0.210
eat	0.221	0.256	0.236	0.153	0.245

Practice Softmax: Head 2, Apple → bought

Raw score: 3.911, Scaled: 3.911 ÷ 2 = 1.956

Exponentials: [e^1.459, e^1.956, e^1.876, e^1.172, e^1.586] = [4.304, 7.071, 6.529, 3.229, 4.885]

Sum = 26.018, Softmax = 7.071 ÷ 26.018 = ?

Apple's Attention Weights (Head 2)

Word	Weight
I	0.165
bought	0.272 ⭐
apple	0.247
to	0.116
eat	0.200

📍 STEP 9.6: Head 2 - Output Vector

We compute the weighted sum of value vectors using attention weights to get Head 2's output for "apple".

Apple's Output (Head 2)

Weighted Value Vectors:
0.165 × [0.65, 0.51, 0.42, 0.38] = [0.107, 0.084, 0.069, 0.063]
0.272 × [0.72, 0.49, 0.55, 0.41] = [0.196, 0.133, 0.150, 0.112]
0.247 × [0.68, 0.47, 0.53, 0.39] = [0.168, 0.116, 0.131, 0.096]
0.116 × [0.51, 0.42, 0.36, 0.45] = [0.059, 0.049, 0.042, 0.052]
0.200 × [0.61, 0.46, 0.49, 0.37] = [0.122, 0.092, 0.098, 0.074]

Sum:
Component 1: 0.107 + 0.196 + 0.168 + 0.059 + 0.122 = 0.61
Component 2: 0.084 + 0.133 + 0.116 + 0.049 + 0.092 = 0.42
Component 3: 0.069 + 0.150 + 0.131 + 0.042 + 0.098 = 0.47
Component 4: 0.063 + 0.112 + 0.096 + 0.052 + 0.074 = 0.32

Output: [0.61, 0.42, 0.47, 0.32]

🔍 STEP 9.7: Head 3 - Contextual Purpose (Query Vectors)

Head 3 focuses on contextual purpose, transforming embeddings into queries to find purpose-driven connections in the sentence.

Weight Matrices for Head 3 (4×4)

W_Q3 (Query Matrix)

Transforms words into purpose-driven "search queries"

0.7

0.4

0.3

0.5

0.3

0.8

0.6

0.2

0.9

0.5

0.7

0.4

0.6

0.2

0.4

0.8

W_K3 (Key Matrix)

Transforms words into purpose-driven "information advertisements"

0.6

0.3

0.4

0.5

0.4

0.7

0.5

0.3

0.8

0.6

0.8

0.2

0.5

0.7

0.3

0.9

W_V3 (Value Matrix)

Transforms words into purpose-driven "information content"

0.5

0.7

0.4

0.3

0.8

0.4

0.5

0.6

0.3

0.9

0.5

0.7

0.6

0.5

0.6

0.4

Apple's Q₃ Vector Calculation

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_Q3:

Q₃[0] = 0.6×0.7 + 0.4×0.3 + 1.0×0.9 + 0.2×0.6 = 0.42 + 0.12 + 0.9 + 0.12 = 1.57
Q₃[1] = 0.6×0.4 + 0.4×0.8 + 1.0×0.5 + 0.2×0.2 = 0.24 + 0.32 + 0.5 + 0.04 = 1.10
Q₃[2] = 0.6×0.3 + 0.4×0.6 + 1.0×0.7 + 0.2×0.4 = 0.18 + 0.24 + 0.7 + 0.08 = 1.20
Q₃[3] = 0.6×0.5 + 0.4×0.2 + 1.0×0.4 + 0.2×0.8 = 0.3 + 0.08 + 0.4 + 0.16 = 0.94

Apple's Query (Head 3): [1.57, 1.10, 1.20, 0.94]

✅ All Q₃ Vectors Complete

Word	Q₃ Vector (4D)	Meaning
I	[1.05, 0.63, 0.79, 0.67]	"What is my purpose?"
bought	[1.13, 0.99, 0.87, 0.89]	"What is the purpose of my action?"
apple	[1.57, 1.10, 1.20, 0.94]	"What is my contextual purpose?"
to	[0.61, 0.67, 0.56, 0.69]	"What purpose do I serve?"
eat	[1.01, 0.94, 0.76, 0.84]	"What is my action's purpose?"

🔑 STEP 9.8: Head 3 - Contextual Purpose (Key Vectors)

Key vectors in Head 3 advertise what purpose-driven information each word can provide.

Your Turn: Calculate Apple's K₃ Vector!

Apple embedding [0.6, 0.4, 1.0, 0.2] × W_K3

K₃[0] = 0.6×0.6 + 0.4×0.4 + 1.0×0.8 + 0.2×0.5 = ?

✅ All K₃ Vectors Complete

Word	K₃ Vector (4D)	Advertisement
I	[0.99, 0.71, 0.84, 0.62]	"I provide agent purpose!"
bought	[1.30, 0.98, 0.92, 0.77]	"I provide action purpose!"
apple	[1.42, 0.84, 0.99, 0.73]	"I provide object purpose!"
to	[0.64, 0.69, 0.57, 0.67]	"I provide preposition purpose!"
eat	[0.97, 0.86, 0.79, 0.71]	"I provide action purpose!"

💎 STEP 9.9: Head 3 - Contextual Purpose (Value Vectors)

Value vectors in Head 3 carry the purpose-driven information each word contributes.

✅ All V₃ Vectors Complete

Word	V₃ Vector (4D)	Information Content
I	[0.67, 0.53, 0.44, 0.39]	Agent purpose
bought	[0.74, 0.51, 0.57, 0.43]	Action purpose
apple	[0.70, 0.49, 0.55, 0.41]	Object purpose
to	[0.53, 0.44, 0.38, 0.47]	Preposition purpose
eat	[0.63, 0.48, 0.51, 0.39]	Action purpose

🎯 STEP 9.10: Head 3 - Calculate Attention Scores

We compute dot products between each word's query vector and all key vectors to find purpose-driven relationships.

Attention Scores Matrix (Head 3)

Query \ Key	I [0.99, 0.71, 0.84, 0.62]	bought [1.30, 0.98, 0.92, 0.77]	apple [1.42, 0.84, 0.99, 0.73]	to [0.64, 0.69, 0.57, 0.67]	eat [0.97, 0.86, 0.79, 0.71]
I [1.05, 0.63, 0.79, 0.67]	1.95	1.91	1.88	1.31	1.85
bought [1.13, 0.99, 0.87, 0.89]	2.15	2.28	2.20	1.47	2.13
apple [1.57, 1.10, 1.20, 0.94]	2.53	2.73	2.67	1.70	2.64
to [0.61, 0.67, 0.56, 0.69]	1.49	1.60	1.54	1.62	1.57
eat [1.01, 0.94, 0.76, 0.84]	2.07	2.20	2.12	1.44	2.09

Apple's Attention Scores (Head 3)

Apple Query: [1.57, 1.10, 1.20, 0.94]

Apple → I:
[1.57, 1.10, 1.20, 0.94] · [0.99, 0.71, 0.84, 0.62] = 1.57×0.99 + 1.10×0.71 + 1.20×0.84 + 0.94×0.62
= 1.554 + 0.781 + 1.008 + 0.583 = 3.926

Apple → bought:
= 1.57×1.30 + 1.10×0.98 + 1.20×0.92 + 0.94×0.77 = 2.041 + 1.078 + 1.104 + 0.724 = 4.947

Apple → apple:
= 1.57×1.42 + 1.10×0.84 + 1.20×0.99 + 0.94×0.73 = 2.229 + 0.924 + 1.188 + 0.686 = 5.027

Apple → to:
= 1.57×0.64 + 1.10×0.69 + 1.20×0.57 + 0.94×0.67 = 1.005 + 0.759 + 0.684 + 0.630 = 3.078

Apple → eat:
= 1.57×0.97 + 1.10×0.86 + 1.20×0.79 + 0.94×0.71 = 1.523 + 0.946 + 0.948 + 0.667 = 4.084 ⭐ HIGHEST!

📐 STEP 9.11: Head 3 - Scale and Softmax

We scale attention scores by √4 = 2 and apply softmax to get attention weights that sum to 1.0 for each query word.

Attention Weights Matrix (Head 3)

Query \ Key	I	bought	apple	to	eat
I	0.241	0.232	0.225	0.127	0.218
bought	0.228	0.260	0.239	0.115	0.223
apple	0.141	0.351	0.151	0.104	0.253
to	0.187	0.209	0.197	0.213	0.203
eat	0.225	0.257	0.237	0.119	0.229

Practice Softmax: Head 3, Apple → eat

Raw score: 4.084, Scaled: 4.084 ÷ 2 = 2.042

Exponentials: [e^1.963, e^2.474, e^2.514, e^1.539, e^2.042] = [7.125, 11.870, 12.353, 4.657, 7.706]

Sum = 43.711, Softmax = 7.706 ÷ 43.711 = ?

Apple's Attention Weights (Head 3)

Word	Weight
I	0.163
bought	0.204
apple	0.177
to	0.121
eat	0.335 ⭐

📍 STEP 9.12: Head 3 - Output Vector

We compute the weighted sum of value vectors using attention weights to get Head 3's output for "apple".

Apple's Output (Head 3)

Weighted Value Vectors:
0.163 × [0.67, 0.53, 0.44, 0.39] = [0.109, 0.086, 0.072, 0.064]
0.204 × [0.74, 0.51, 0.57, 0.43] = [0.151, 0.104, 0.116, 0.088]
0.177 × [0.70, 0.49, 0.55, 0.41] = [0.124, 0.087, 0.097, 0.073]

0.121 × [0.53, 0.44, 0.38, 0.47] = [0.064, 0.053, 0.046, 0.057]

0.335 × [0.63, 0.48, 0.51, 0.39] = [0.211, 0.161, 0.171, 0.131]

Sum:
Component 1: 0.109 + 0.151 + 0.124 + 0.064 + 0.211 = 0.66
Component 2: 0.086 + 0.104 + 0.087 + 0.053 + 0.161 = 0.49
Component 3: 0.072 + 0.116 + 0.097 + 0.046 + 0.171 = 0.50
Component 4: 0.064 + 0.088 + 0.073 + 0.057 + 0.131 = 0.41

Output: [0.66, 0.49, 0.50, 0.41]

🔗 STEP 10: Concatenate Head Outputs

We combine the outputs from all three heads for "apple" to form a 12D vector, capturing semantic, syntactic, and purposive insights.

Apple's Multi-Head Output

Head 1 (Semantic): [0.58, 0.46, 0.37, 0.46]
Head 2 (Syntactic): [0.61, 0.42, 0.47, 0.32]
Head 3 (Purpose): [0.66, 0.49, 0.50, 0.41]

Concatenated: [0.58, 0.46, 0.37, 0.46, 0.61, 0.42, 0.47, 0.32, 0.66, 0.49, 0.50, 0.41]

🎉 STEP 11: Understanding the Result

The 12D vector represents "apple" with enriched context:

Semantic (Head 1): Strong attention to "eat" (0.382) emphasizes apple as a fruit to be consumed.
Syntactic (Head 2): High attention to "bought" (0.272) highlights apple's role as the object of purchase.
Purpose (Head 3): Strong attention to "eat" (0.335) reinforces apple's purpose as food.

This multi-faceted representation enables the transformer to understand "apple" in context!

❓ Quiz Time!

Test Your Knowledge

Q1: Why does Head 1 give high attention to "eat" for "apple"?

a) "Eat" is the subject
b) "Eat" provides semantic context as consumption
c) "Eat" is syntactically closest

Q2: What does the softmax step ensure?

a) Scores are scaled
b) Weights sum to 1.0
c) Vectors are 4D