Every essay
The Modern Transformer Anatomy
The models you actually fine-tune (Llama, Qwen) are decoder-only Transformers with five upgrades over the 2017 original — RoPE, RMSNorm, SwiGLU, GQA, and (sometimes) Mixture-of-Experts. Same core idea, much better engineering.
RLHF & Alignment
How a raw next-token predictor becomes a helpful assistant — the three-stage RLHF pipeline behind InstructGPT and ChatGPT, and the simpler DPO alternative that followed.
GPT, Decoder-Only Models & In-Context Learning
Why "predict the next token" — scaled up — became the recipe for general-purpose AI. Causal attention, autoregressive generation, and the in-context learning that emerged with GPT-3.
What Came After the Transformer
The Transformer didn't lead to one next thing — itbranched, thenscaled, then gotaligned. A map of the lineage from 2017 to today's LLMs, tying the key papers into a single timeline.
Attention Is All You Need
The 2017 paper that replaced recurrence with attention and made modern LLMs possible — self-attention, Q/K/V, multi-head, positional encoding, and the full encoder–decoder architecture.
Statistics & Inference
Sampling and practical sampling strategies, the Central Limit Theorem, hypothesis testing, and the z / t / chi-square tests — when to use each, what a p-value really means, and where it all applies.
Probability Concepts
Conditional probability, independence, the Law of Total Probability, and Bayes' Theorem — with the intuitive examples that make them stick, and why they sit at the heart of machine learning.
Cross-Entropy
The loss function behind virtually every classifier and every LLM pre-training run. Where it comes from (surprise & coding theory), why it punishes confident wrong predictions so brutally, and why it pairs so cleanly with softmax.
Boltzmann / Gibbs Distribution
The physics-born distribution that quietly powers softmax, the temperature knob in LLM sampling, attention weights, and every energy-based model. Where it comes from, what it means, and where it shows up in modern deep learning.
The Softmax Function
How a vector of arbitrary scores becomes a probability distribution — the formula, why the exponential, temperature, numerical stability, the gradient, and why softmax + cross-entropy is the standard classifier head.
Artificial Neural Networks — A Refresher
Neurons, activations, the forward pass, loss, gradient descent and backpropagation, optimisers, and the families of neural networks — a bridge from classical ANN theory to the modern deep-learning era.
System Thinking Approach to Data Analytics
Why analytics must be a tool that assists decision-making, not an end in itself — and a five-dimension framework for thinking about data inside a real organisation.
