GenAI-2026 · S03
GPT Evolution & Alignment
2017 → 2023
Foundation
Attention Is All You Need
Vaswani et al. · Google · 2017
6+6
Encoder + Decoder layers
GPT Series
GPT-1 — Generative Pre-Training
Radford et al. · OpenAI · 2018
Encoder Model
BERT — Bidirectional Transformers
Devlin et al. · Google · 2018
NSP
Next sentence prediction
NLU
Best for classification
GPT Series
GPT-2 — Language Models are Multitask Learners
Radford et al. · OpenAI · 2019
Zero-Shot
No task-specific training
Encoder Model
BART — Denoising Seq2Seq
Lewis et al. · Facebook AI · 2019
BERT
Encoder (bidirectional)
GPT
Decoder (autoregressive)
Denoise
Learn by reconstruction
GPT Series
GPT-3 — Few-Shot Learners at Scale
Brown et al. · OpenAI · 2020
Alignment
InstructGPT — Learning to Follow Instructions
Ouyang et al. · OpenAI · 2022
85%
Preferred over raw GPT-3
Alignment
HH-RLHF — Helpful and Harmless Assistant
Bai et al. · Anthropic · 2022
Harmless
Competing objectives
Claude
Anthropic's assistant
Alignment
Constitutional AI — Harmlessness via AI Feedback
Bai et al. · Anthropic · 2022
0
Human labellers for safety
RLAIF
AI feedback replaces humans
Loop
Generate → Critique → Revise
Alignment
RLAIF — Scaling RLHF with AI Feedback
Lee et al. · Google · 2023
71–73%
Win rate vs human labels
$0
Marginal cost per label
Validates
Constitutional AI approach
Alignment
DPO — Direct Preference Optimisation
Rafailov et al. · Stanford · 2023
1
Training phase (vs 3 in RLHF)
LLaMA/Mistral
Most open-source models use DPO
Alignment
SELF-REFINE — Iterative Refinement
Madaan et al. · CMU / AI2 · 2023
3 Roles
One model plays all
Agents
Pattern behind AI agents
Summary
The Full Story in One Diagram
2017 → Today
ChatGPT + Claude
What you use today