Series Β· π Private
Transformer Architectures
From the 2017 paper to today's LLMs. Self-attention and the QKV trio, the GPT decoder-only branch, RLHF and alignment, and the modern transformer anatomy you'd actually fine-tune.
- 01Attention Is All You Need πThe 2017 paper that replaced recurrence with attention and made modern LLMs possible β self-attention, Q/K/V, multi-head, positional encoding, and the full encoderβdecoder architecture.6 min
- 02What Came After the Transformer πThe Transformer didn't lead to one next thing β itbranched, thenscaled, then gotaligned. A map of the lineage from 2017 to today's LLMs, tying the key papers into a single timeline.3 min
- 03GPT, Decoder-Only Models & In-Context Learning πWhy "predict the next token" β scaled up β became the recipe for general-purpose AI. Causal attention, autoregressive generation, and the in-context learning that emerged with GPT-3.4 min
- 04RLHF & Alignment πHow a raw next-token predictor becomes a helpful assistant β the three-stage RLHF pipeline behind InstructGPT and ChatGPT, and the simpler DPO alternative that followed.4 min
- 05The Modern Transformer Anatomy πThe models you actually fine-tune (Llama, Qwen) are decoder-only Transformers with five upgrades over the 2017 original β RoPE, RMSNorm, SwiGLU, GQA, and (sometimes) Mixture-of-Experts. Same core idea, much better engineering.4 min