Series · 🔒 Private

Transformer Architectures

From the 2017 paper to today's LLMs. Self-attention and the QKV trio, the GPT decoder-only branch, RLHF and alignment, and the modern transformer anatomy you'd actually fine-tune.

5 lessons~21 min read

01Attention Is All You Need 🔒The 2017 paper that replaced recurrence with attention and made modern LLMs possible — self-attention, Q/K/V, multi-head, positional encoding, and the full encoder–decoder architecture.6 min
02What Came After the Transformer 🔒The Transformer didn't lead to one next thing — itbranched, thenscaled, then gotaligned. A map of the lineage from 2017 to today's LLMs, tying the key papers into a single timeline.3 min
03GPT, Decoder-Only Models & In-Context Learning 🔒Why "predict the next token" — scaled up — became the recipe for general-purpose AI. Causal attention, autoregressive generation, and the in-context learning that emerged with GPT-3.4 min
04RLHF & Alignment 🔒How a raw next-token predictor becomes a helpful assistant — the three-stage RLHF pipeline behind InstructGPT and ChatGPT, and the simpler DPO alternative that followed.4 min
05The Modern Transformer Anatomy 🔒The models you actually fine-tune (Llama, Qwen) are decoder-only Transformers with five upgrades over the 2017 original — RoPE, RMSNorm, SwiGLU, GQA, and (sometimes) Mixture-of-Experts. Same core idea, much better engineering.4 min