GenAI-2026 · S03

GPT Evolution & Alignment

2017 → 2023
11 Research Papers That Built Modern AI Foundation → Scale → Alignment → What You Use Today 2017 Attention Is All You Need 2018 GPT-1 BERT 2019 GPT-2 BART 2020 GPT-3 · 175B 2022 InstructGPT HH-RLHF Constitutional AI 2023 RLAIF DPO Foundation GPT Series Encoder Models Alignment Research → ChatGPT & Claude are the direct products of this 6-year research journey
1
Foundation
5
GPT + Encoder
6
Alignment
6
Years
11
Papers Total
Foundation

Attention Is All You Need

Vaswani et al. · Google · 2017
ENCODER Multi-Head Self-Attention reads all tokens simultaneously Add & Norm residual connection + layer normalise Feed-Forward Network position-wise transformation × 6 stacked layers The cat sat Input (English) Cross-Attention Decoder attends to Encoder DECODER Masked Self-Attention can only see past tokens (causal) Cross-Attention + Norm attends to encoder output Feed-Forward + Softmax predict next output token × 6 stacked layers Il gatto sedeva Output (Italian)
512
d_model
8
Attention heads
6+6
Encoder + Decoder layers
No RNN
Fully parallel
GPT Series

GPT-1 — Generative Pre-Training

Radford et al. · OpenAI · 2018
PHASE 1 — PRE-TRAINING Unlabeled Text Corpus (BooksCorpus) 800M words — predict the next token 12-Layer Transformer Decoder-only · 117M parameters Transfer copy weights PHASE 2 — FINE-TUNING Small labeled datasets (task-specific) Sentence → Label Q & A → Answer Text → Entailment Span → Similarity Task-Specific Classifier Small output layer added on top SOTA on 9 of 12 NLP tasks
117M
Parameters
800M
Training words
9/12
New SOTA tasks
Decoder-only
Architecture
Encoder Model

BERT — Bidirectional Transformers

Devlin et al. · Google · 2018
GPT: reads → LEFT TO RIGHT BERT: ← BOTH DIRECTIONS → ← reads context from RIGHT → The cat [MASK] was: "sat" on the mat [SEP] ← reads context from LEFT → Predicts: "sat" Masked Language Modelling (MLM)
340M
Parameters (Large)
MLM
Masked language model
NSP
Next sentence prediction
NLU
Best for classification
GPT Series

GPT-2 — Language Models are Multitask Learners

Radford et al. · OpenAI · 2019
🎯 Zero-Shot Transfer No fine-tuning — tasks solved from prompting only 📚 WebText — 8 Billion Tokens 40 GB of quality-filtered Reddit outbound links GPT-1 117M Pre-train + fine-tune GPT-2 Small 117M Larger dataset GPT-2 Medium 345M 345M params GPT-2 XL 1.5B parameters Zero-shot transfer same → 3× ↑ 13× ↑
1.5B
Max parameters
8B
Training tokens
Zero-Shot
No task-specific training
10×
Bigger than GPT-1
Encoder Model

BART — Denoising Seq2Seq

Lewis et al. · Facebook AI · 2019
① Original Text ② Corrupted Input ③ BERT Encoder + GPT Decoder The cat sat on the mat She went to the store Rain falls on the plain Birds fly south in winter clean training data corrupt The [MASK] sat on mat ← sentence deleted → plain the on falls Rain Birds [MASK] south winter masking · deletion · rotation corruption strategies encode BERT Encoder Bidirectional Self-Attention Context Vector GPT Decoder Autoregressive Cross-Attention Generates text Reconstructed Original Text Best for Summarisation · ROUGE-1 = 44.16
BERT
Encoder (bidirectional)
GPT
Decoder (autoregressive)
44.16
ROUGE-1 on CNN/DM
Denoise
Learn by reconstruction
GPT Series

GPT-3 — Few-Shot Learners at Scale

Brown et al. · OpenAI · 2020
Parameter Scale 0.1B GPT-1 1.5B GPT-2 175B parameters GPT-3 Few-Shot In-Context Learning No gradient update — examples live in the prompt Example 1: Q: capital of France? → A: Paris Example 2: Q: capital of Germany? → A: Berlin Example 3: Q: capital of Japan? → A: Tokyo Query: Q: capital of Italy? → A: ? → Rome ✓ Zero-Shot 0 examples One-Shot 1 example Few-Shot ★ k examples in prompt
175B
Parameters
300B
Training tokens
96
Attention heads
No
Fine-tuning needed
Alignment

InstructGPT — Learning to Follow Instructions

Ouyang et al. · OpenAI · 2022
① Supervised Fine-Tune 👤 Labeller SFT Model Human writes ideal responses to prompts ~13K prompt-response pairs Start: GPT-3 (175B) ② Train Reward Model Output A Output B Output C rank Reward Model score 0..1 Human ranks 33K model comparisons A > B > C preference pairs ③ PPO Reinforcement SFT Model generates response to new prompt Reward Model scores it Policy gradient update Repeat until model reliably follows human intent → Became ChatGPT
3
Training phases
85%
Preferred over raw GPT-3
1.3B
Beats 175B unaligned
ChatGPT
Direct descendant
Alignment

HH-RLHF — Helpful and Harmless Assistant

Bai et al. · Anthropic · 2022
Being helpful ≠ being safe — Anthropic studied this tradeoff directly HELPFUL Answers user's question Provides useful information HARMLESS Refuses harmful requests Avoids dangerous content ↔ tension ↔ 170,000 human preference pairs → Became Claude
170K
Preference pairs
Helpful
vs
Harmless
Competing objectives
Claude
Anthropic's assistant
Alignment

Constitutional AI — Harmlessness via AI Feedback

Bai et al. · Anthropic · 2022
The Constitution ✦ Do not assist with illegal acts ✦ Avoid deceptive content ✦ Respect human dignity ✦ No harmful information ✦ Protect personal privacy ✦ Avoid manipulation + 10 more principles... 16 Written Rules No human labellers needed ① GENERATE AI writes initial response ② CRITIQUE Check against 16 rules ③ REVISE AI rewrites to comply with constitution RLAIF AI labels replace human feedback guides
16
Written principles
0
Human labellers for safety
RLAIF
AI feedback replaces humans
Loop
Generate → Critique → Revise
Alignment

RLAIF — Scaling RLHF with AI Feedback

Lee et al. · Google · 2023
Can AI replace humans as the labeller in RLHF? Human RLHF 👤 Human Annotator Prefers A over B Preference Label Slow · Expensive · Inconsistent Hard to scale beyond 100K labels $$$ per label VS RLAIF (AI Labels) 🤖 AI Model (Claude / PaLM) Prefers A over B AI Preference Label Fast · Cheap · Consistent Scales to millions of labels Near-zero cost per label AI labels ≈ Human labels · 71–73% win rate at massive scale
71–73%
Win rate vs human labels
Scalable labelling
$0
Marginal cost per label
Validates
Constitutional AI approach
Alignment

DPO — Direct Preference Optimisation

Rafailov et al. · Stanford · 2023
RLHF needs 3 training phases — DPO does it in 1 closed-form step RLHF (3 phases) ① SFT — Fine-tune base LM Supervised on demonstrations ② Train Reward Model Separate model learns human prefs ③ PPO Optimisation Tricky RL loop, KL divergence Aligned Model Complex · Unstable · Slow DPO (1 phase) Preference Data (y_w > y_l) Preferred vs rejected response pairs closed-form solution Direct Policy Update No reward model needed No PPO needed Stable · Simple · Same results ✓ Aligned Model ✓
1
Training phase (vs 3 in RLHF)
No RM
No reward model
No PPO
No RL required
LLaMA/Mistral
Most open-source models use DPO
Alignment

SELF-REFINE — Iterative Refinement

Madaan et al. · CMU / AI2 · 2023
Same model plays all 3 roles — no extra training required ① GENERATE Model writes initial output passes output → ② CRITIQUE Same model finds flaws ← passes feedback ③ REFINE Model rewrites better output Output Quality Pass 1 Pass 2 Pass 3 Pass 4 ✓ +20% average improvement No extra training needed Works at inference time only
3 Roles
One model plays all
+20%
Average improvement
0
Additional training
Agents
Pattern behind AI agents
Summary

The Full Story in One Diagram

2017 → Today
Transformer (2017) Self-attention · No RNN decoder encoder BERT (2018) Bidirectional · NLU BART (2019) Encoder+Decoder GPT-1 (2018) 117M · pre-train+fine-tune 10× scale GPT-2 (2019) 1.5B · zero-shot 100× scale GPT-3 (2020) 175B · few-shot mastery RLHF ALIGNMENT TRACK InstructGPT (2022) SFT → RM → PPO HH-RLHF (2022) Helpful + Harmless Constitutional AI (2022) AI self-critique against 16 written rules RLAIF (2023) AI labels ≈ human labels DPO (2023) No reward model needed SELF-REFINE (2023) Generate → Critique → Refine · no extra training → ChatGPT → Claude → What you use today
1
Architecture
3
GPT Generations
2
Encoder Models
6
Alignment Methods
ChatGPT + Claude
What you use today