Browse by series
Foundations
Core mathematics and concepts behind modern ML β neural-net fundamentals, the softmax + cross-entropy stack, statistical inference. Each note stands alone; together they form the shared language of everything else in Data.
View the series βHardware & Compute
Where the maths meets the silicon. GPUs, CUDA, Apple's MPS/Metal/MLX stack, and what runs locally vs. needs a rented cloud GPU.
View the series βTransformer Architectures
From the 2017 paper to today's LLMs. Self-attention and the QKV trio, the GPT decoder-only branch, RLHF and alignment, and the modern transformer anatomy you'd actually fine-tune.
View the series βEvery essay
MPS, Metal, MLX & CUDA
The GPU-compute landscape for machine learning β why GPUs, what CUDA / Metal / MLX each are, unified memory on Apple Silicon, and what runs locally on a Mac versus what needs a cloud GPU.
The Modern Transformer Anatomy
The models you actually fine-tune (Llama, Qwen) are decoder-only Transformers with five upgrades over the 2017 original β RoPE, RMSNorm, SwiGLU, GQA, and (sometimes) Mixture-of-Experts. Same core idea, much better engineering.
RLHF & Alignment
How a raw next-token predictor becomes a helpful assistant β the three-stage RLHF pipeline behind InstructGPT and ChatGPT, and the simpler DPO alternative that followed.
GPT, Decoder-Only Models & In-Context Learning
Why "predict the next token" β scaled up β became the recipe for general-purpose AI. Causal attention, autoregressive generation, and the in-context learning that emerged with GPT-3.
What Came After the Transformer
The Transformer didn't lead to one next thing β itbranched, thenscaled, then gotaligned. A map of the lineage from 2017 to today's LLMs, tying the key papers into a single timeline.
Attention Is All You Need
The 2017 paper that replaced recurrence with attention and made modern LLMs possible β self-attention, Q/K/V, multi-head, positional encoding, and the full encoderβdecoder architecture.
Statistics & Inference
Sampling and practical sampling strategies, the Central Limit Theorem, hypothesis testing, and the z / t / chi-square tests β when to use each, what a p-value really means, and where it all applies.
Probability Concepts
Conditional probability, independence, the Law of Total Probability, and Bayes' Theorem β with the intuitive examples that make them stick, and why they sit at the heart of machine learning.
Cross-Entropy
The loss function behind virtually every classifier and every LLM pre-training run. Where it comes from (surprise & coding theory), why it punishes confident wrong predictions so brutally, and why it pairs so cleanly with softmax.
Boltzmann / Gibbs Distribution
The physics-born distribution that quietly powers softmax, the temperature knob in LLM sampling, attention weights, and every energy-based model. Where it comes from, what it means, and where it shows up in modern deep learning.
The Softmax Function
How a vector of arbitrary scores becomes a probability distribution β the formula, why the exponential, temperature, numerical stability, the gradient, and why softmax + cross-entropy is the standard classifier head.
Artificial Neural Networks β A Refresher
Neurons, activations, the forward pass, loss, gradient descent and backpropagation, optimisers, and the families of neural networks β a bridge from classical ANN theory to the modern deep-learning era.