Deep Learning Explained with Mathematics: From Linear Regression to LLMs

Roadmap

---
config:
  mindmap:
    padding: 18
    maxNodeWidth: 220
---
mindmap
  root((**Deep Learning**))
    **Probability and Statistics**
      Gaussian Distribution
      Expectation and Variance
      Entropy
      KL Divergence
      Bayes Theorem
    **Fundamentals**
      Linear Regression
      Loss Functions
        SE MSE
        Cross-Entropy
      Derivatives
        Gradient
        Chain Rule
        Gradient Descent
    **Linear Algebra**
      Vectors
        Dot Product
        Norm
        Cosine Similarity
      Matrices
        Matrix Multiply
        Transpose
        Broadcast
    **Neural Networks**
      Single Neuron
      MLP
        Forward Pass
        Backpropagation
      Activations
        ReLU Sigmoid Tanh
        Softmax GELU
      Normalisation
        Batch Norm
        Layer Norm
      Pathologies
        Vanishing Gradient
        Exploding Gradient
        Zigzag
      ResNet Skip Connection
      Training
        Dropout
        Adam
        L2 Regularisation
    **Architectures**
      CNN
        2D Convolution
        Pooling
        Feature Maps
      RNN
        Vanilla RNN
        LSTM
        GRU
      Attention
        Scaled Dot-Product
        Multi-Head
        Positional Encoding
        Transformer Block
      Encoder-Decoder
        BERT MLM
        GPT vs BERT
      ViT
    **Transfer Learning**
      Pre-train to Fine-tune
      LoRA
    **Generative AI**
      VAE
        ELBO
        Reparameterisation
      GAN
        Generator
        Discriminator
      Diffusion DDPM
        Forward Noise
        Reverse Denoise
      Autoregressive
        Temperature
        Top-p Sampling
      RLHF
        Reward Model
        PPO
        DPO
    **LLM**
      Architecture
        Tokenisation BPE
        Causal Attention
        Unembedding
      KV Cache
      MoE
      Perplexity PPL
      Scaling Laws
      Prompting
        k-shot
        Chain-of-Thought
        Self-Consistency
      RAG
    **Multimodal AI**
      CLIP
        InfoNCE Loss
      VLM
        Visual Encoder
        Projection
      Cross-Modal Attention
      Text-to-Image
        CFG
        Latent Diffusion

Probability & Statistics Foundations

Gaussian Distribution

The most important distribution in deep learning. A random variable $X \sim N (μ, σ^{2})$ has density:

$p (x) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$

$μ$ : mean (centre of the bell curve)
$σ^{2}$ : variance (spread); $σ$ is the standard deviation

Multivariate Gaussian $x \sim N (μ, Σ)$ , $Σ \in R^{d \times d}$ covariance matrix:

$p (x) = \frac{1}{( 2 π ) ^{d /2} ∣Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$

Expectation & Variance

$E [X] = \int x p (x) d x, E [f (X)] = \int f (x) p (x) d x$

$Var (X) = E [(X - E [X])^{2}] = E [X^{2}] - (E [X])^{2}$

Entropy & KL Divergence

Entropy — measures uncertainty of distribution $p$ :

$H (p) = - \sum_{x} p (x) lo g p (x) (discrete), H (p) = - \int p (x) lo g p (x) d x (continuous)$

KL Divergence — measures how much $q$ differs from $p$ (not a distance — asymmetric):

$D_{KL} (p ∥ q) = \sum_{x} p (x) lo g \frac{p ( x )}{q ( x )} \geq 0, = 0 ⟺ p = q$

Cross-entropy loss is directly related: $L_{CE} = H (p) + D_{KL} (p ∥ q)$ where $p$ is the true label distribution.

Bayes’ Theorem

$P (θ ∣ D) = \frac{P ( D ∣ θ ) P ( θ )}{P ( D )}$

$P (θ)$ : prior — belief about parameters before seeing data
$P (D ∣ θ)$ : likelihood — how well $θ$ explains data
$P (θ ∣ D)$ : posterior — updated belief after data
$P (D)$ : evidence — normalisation constant

MLE (Maximum Likelihood Estimation) maximises $P (D ∣ θ)$ ; MAP adds a prior regulariser.

Some ML essential concept in Methematical way

Linear Regression(LR)

$f (x) = w^{⊤} x + b = \sum_{j = 1}^{d} w_{j} x_{j} + b$

In matrix form over $n$ samples: $\hat{y} = X w + b 1, X \in R^{n \times d}$

Square Error(SE)

$SE (\overset{y}{^}, y) = (\overset{y}{^} - y)^{2}$

Mean Square Error(MSE)

$L_{MSE} (θ) = \frac{1}{n} \sum_{i = 1}^{n} (f_{θ} (x^{(i)}) - y^{(i)})^{2}$

Cross-Entropy Loss (CE)

Used for classification. For a true label $y \in {1, \dots, K}$ and predicted probability vector $\hat{p} = softmax (z)$ :

$L_{CE} = - \sum_{k = 1}^{K} y_{k} lo g \overset{p}{^}_{k}$

For binary classification ( $K = 2$ ):

$L_{BCE} = - [y lo g \overset{p}{^} + (1 - y) lo g (1 - \overset{p}{^})]$

Over $n$ samples (multiclass):

$L_{CE} (θ) = - \frac{1}{n} \sum_{i = 1}^{n} lo g \overset{p}{^}_{y^{(i)}}^{(i)}$

Derivatives

Derivative — instantaneous rate of change:

$f^{'} (x) = \frac{df}{d x} = lim_{h \to 0} \frac{f ( x + h ) - f ( x )}{h}$

Gradient — multivariate generalisation $f : R^{d} \to R$ :

$\nabla_{θ} L = \partial L / \partial θ_{1} ⋮ \partial L / \partial θ_{d} \in R^{d}$

Chain Rule — for $z = g (f (x))$ :

$\frac{d z}{d x} = \frac{d z}{df} \cdot \frac{df}{d x}$

For deep compositions $z = h (g (f (x)))$ :

$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial g} \cdot \frac{\partial g}{\partial f} \cdot \frac{\partial f}{\partial x}$

Gradient Descent — minimise $L (θ)$ by stepping opposite the gradient:

$θ_{t + 1} = θ_{t} - η \cdot \nabla_{θ} L (θ_{t})$

Four Step Process For Machine Learning

Collect the data
Define the model’s structure
Define the loss function
Minimize the loss

Vector is all you need

A vector $x \in R^{d}$ encodes a data point with $d$ features.

Operations:

Add: $(u + v)_{i} = u_{i} + v_{i}$
Dot product: $u \cdot v = u^{⊤} v = \sum_{i} u_{i} v_{i}$
$ℓ_{2}$ Norm: $∥ x ∥_{2} = \sum_{i} x_{i}^{2}$
Cosine similarity: $cos (u, v) = \frac{u ^{⊤} v}{∥ u ∥∥ v ∥}$

Matrix

A matrix $W \in R^{m \times n}$ represents a linear transformation $R^{n} \to R^{m}$ .

Operations:

Add: $(A + B)_{ij} = A_{ij} + B_{ij}$
Mul (matrix product): $(A B)_{ij} = \sum_{k} A_{ik} B_{kj}$
Broadcast: scalar/vector ops extend across batch dimensions
Dot product / inner product: $u^{⊤} v = \sum_{i} u_{i} v_{i}$
Transpose: $(A^{⊤})_{ij} = A_{ji}$

Neural Network

Single neuron:

$a = ϕ (w^{⊤} x + b)$

MLP — layer $ℓ$ forward pass:

$z^{(ℓ)} = W^{(ℓ)} a^{(ℓ - 1)} + b^{(ℓ)}, a^{(ℓ)} = ϕ (z^{(ℓ)})$

Backpropagation (chain rule applied layer by layer):

$δ^{(ℓ)} = (W^{(ℓ + 1) ⊤} δ^{(ℓ + 1)}) ⊙ ϕ^{'} (z^{(ℓ)})$

$\frac{\partial L}{\partial W ^{(ℓ)}} = δ^{(ℓ)} a^{(ℓ - 1) ⊤}$

Activation:

ReLU: $ReLU (x) = max (0, x), ReLU^{'} (x) = 1 [x > 0]$

Sigmoid: $σ (x) = \frac{1}{1 + e ^{- x}}, σ^{'} (x) = σ (x) (1 - σ (x))$

Tanh: $tanh (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}, tanh^{'} (x) = 1 - tanh^{2} (x)$

Softmax: $softmax (z)_{k} = \frac{e ^{z_{k}}}{\sum _{j} e ^{z_{j}}}$

GELU (Gaussian Error Linear Unit) — used in BERT, GPT, and all modern Transformers:

$GELU (x) = x \cdot Φ (x) = x \cdot \frac{1}{2} [1 + erf (\frac{x}{2})]$

where $Φ (x)$ is the standard Gaussian CDF. Practical approximation:

$GELU (x) \approx 0.5 x (1 + tanh (\frac{2}{π} (x + 0.044715 x^{3})))$

Unlike ReLU, GELU is smooth and non-zero for $x < 0$ with small probability, which helps gradient flow.

Training detail

Parameters $θ = {W^{(ℓ)}, b^{(ℓ)}}$ : learned by gradient descent.
Hyperparameters: learning rate $η$ , batch size $B$ , layers $L$ , hidden dim $d$ — set before training.

$ℓ_{2}$ Regularisation (weight decay):

$L_{reg} (θ) = L (θ) + \frac{λ}{2} ∥ θ ∥_{2}^{2}$

Dropout — randomly zero out neurons during training with probability $p$ :

$\tilde{a} = m ⊙ a \cdot \frac{1}{1 - p}, m_{i} \sim Bernoulli (1 - p)$

At inference, no dropout is applied (weights already scaled by $1/ (1 - p)$ at train time — inverted dropout).

Batch Normalisation (BN) — normalise activations across the batch dimension, then scale and shift:

$\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}, y_{i} = γ \overset{x}{^}_{i} + β$

where $μ_{B}, σ_{B}^{2}$ are batch mean/variance; $γ, β$ are learned parameters.

Layer Normalisation (LN) — same formula but normalise across the feature dimension (not batch). Preferred in Transformers because it is batch-size independent:

$\hat{x} = \frac{x - μ _{x}}{σ _{x}^{2} + ϵ} \cdot γ + β, μ_{x} = \frac{1}{d} \sum_{j} x_{j}$

Adam optimiser — bias-corrected moment estimates:

$m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}, v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$

$θ_{t + 1} = θ_{t} - η \frac{m ^ _{t}}{v ^ _{t} + ϵ}, \overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}, \overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}$

Vanishing & Exploding Gradients

In a deep network with $L$ layers, the gradient flowing back to layer $ℓ$ is a product of Jacobians:

$\frac{\partial L}{\partial a ^{(ℓ)}} = \frac{\partial L}{\partial a ^{(L)}} \prod_{k = ℓ}^{L - 1} \frac{\partial a ^{(k + 1)}}{\partial a ^{(k)}}$

Each factor $\approx W^{(k)} \cdot diag (ϕ^{'} (z^{(k)}))$ . If the spectral norm $∥ W ∥ < 1$ repeatedly, gradients vanish exponentially; if $∥ W ∥ > 1$ , they explode.

Vanishing: gradients $\to 0$ , early layers learn nothing. Caused by sigmoid/tanh saturation ( $σ^{'} \leq 0.25$ ) + many layers.
Exploding: parameter updates become huge, training diverges.

Fixes: ReLU (gradient = 1 in positive region), residual connections, LayerNorm, gradient clipping:

$g_{t} \leftarrow g_{t} \cdot min (1, \frac{τ}{∥ g _{t} ∥}), τ is the clip threshold$

Zigzag in Gradient Descent

Vanilla SGD with a fixed learning rate $η$ oscillates (zigzags) when the loss surface has different curvatures along different directions — it overshoots in high-curvature directions and undershoots in low-curvature ones:

$θ_{t + 1}^{(i)} = θ_{t}^{(i)} - η \frac{\partial L}{\partial θ ^{(i)}}$

If $η$ is large enough to make progress along the flat direction, it overshoots along the steep direction → zigzag trajectory.

Momentum damps oscillations by accumulating a velocity vector:

$v_{t + 1} = γ v_{t} + η \nabla_{θ} L, θ_{t + 1} = θ_{t} - v_{t + 1}$

The exponential moving average of gradients cancels out oscillating components while reinforcing consistent directions. Adam further applies per-parameter adaptive learning rates via $\overset{v}{^}_{t}$ (second moment), which is why it almost always converges faster than SGD on non-convex landscapes.

ResNet — Residual Connections

Add a skip connection that bypasses one or more layers, letting gradients flow directly to earlier layers:

$y = F (x, {W_{i}}) + x$

$F (x)$ : the residual to learn (e.g. two conv layers)
$x$ : identity shortcut

Why it works: the gradient through the skip path is exactly $1$ — no matter how deep, the chain rule always has a direct path with gradient $\frac{\partial y}{\partial x} = \frac{\partial F}{\partial x} + I$ . This prevents vanishing gradients in very deep networks (ResNet-152, Transformers).

CNN

2D convolution — kernel $K \in R^{k \times k}$ slides over image $X \in R^{H \times W}$ :

$(X * K)_{i, j} = \sum_{m = 0}^{k - 1} \sum_{n = 0}^{k - 1} K_{m, n} \cdot X_{i + m, j + n}$

Each layer applies $C_{out}$ kernels → feature map $\in R^{C_{out} \times H^{'} \times W^{'}}$ .

Pooling — downsample spatial dimensions to reduce computation and add translation invariance:

$MaxPool (X)_{i, j} = max_{(m, n) \in window} X_{i \cdot s + m, j \cdot s + n}$

$AvgPool (X)_{i, j} = \frac{1}{k ^{2}} \sum_{(m, n) \in window} X_{i \cdot s + m, j \cdot s + n}$

where $k$ is the pool size and $s$ is the stride.

RNN

Vanilla RNN — hidden state recurrence:

$h_{t} = tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b)$

LSTM — gated memory cell $(i, f, o$ : input/forget/output gates $)$ :

$f_{t} = σ (W_{f} [h_{t - 1}; x_{t}] + b_{f}), i_{t} = σ (W_{i} [h_{t - 1}; x_{t}] + b_{i})$

$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ tanh (W_{c} [h_{t - 1}; x_{t}] + b_{c})$

$h_{t} = o_{t} ⊙ tanh (c_{t})$

GRU — simplified gating (reset $r_{t}$ , update $z_{t}$ ):

$z_{t} = σ (W_{z} [h_{t - 1}; x_{t}]), r_{t} = σ (W_{r} [h_{t - 1}; x_{t}])$

$h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ tanh (W_{h} [r_{t} ⊙ h_{t - 1}; x_{t}])$

Attention

Scaled Dot-Product Attention — $Q, K, V \in R^{n \times d_{k}}$ :

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$

Multi-Head Attention — $h$ parallel heads, then project:

$head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

$MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}$

Positional Encoding (sinusoidal):

$PE_{(p os, 2 i)} = sin (\frac{p os}{1000 0 ^{2 i / d}}), PE_{(p os, 2 i + 1)} = cos (\frac{p os}{1000 0 ^{2 i / d}})$

Transformer block (with residual + LayerNorm):

$h^{'} = LayerNorm (h + MultiHead (h, h, h))$

$h^{''} = LayerNorm (h^{'} + FFN (h^{'})), FFN (x) = W_{2} ReLU (W_{1} x + b_{1}) + b_{2}$

Language Modelling Objective — maximise log-likelihood over token sequence:

$L = - \frac{1}{T} \sum_{t = 1}^{T} lo g P_{θ} (t_{t} ∣ t_{< t})$

ViT — split image into $N$ patches $p_{i} \in R^{P^{2} C}$ , linearly project to token embeddings, then apply Transformer encoder:

$z_{0} = [x_{cls}; E p_{1}; \dots; E p_{N}] + E_{pos}$

Encoder-Decoder Transformer (original seq2seq, e.g. T5, BART) — encoder processes source sequence bidirectionally; decoder generates target tokens autoregressively with cross-attention over encoder outputs:

$head_{i}^{cross} = Attention (Q_{dec} W_{i}^{Q}, K_{enc} W_{i}^{K}, V_{enc} W_{i}^{V})$

Decoder block = (causal self-attention) → (cross-attention to encoder) → (FFN), each with residual + LayerNorm.

BERT — Bidirectional Encoder

BERT uses the encoder-only Transformer and pre-trains on two objectives:

1. Masked Language Model (MLM) — randomly mask 15 % of tokens, predict them from full bidirectional context:

$L_{MLM} = - \sum_{i \in M} lo g P_{θ} (t_{i} ∣ t_{\ M})$

$M$ : set of masked positions
$t_{\ M}$ : all tokens except the masked ones

Because attention is bidirectional (no causal mask), every token can attend to every other token — unlike GPT which only sees past context.

2. Next Sentence Prediction (NSP) — binary classification: are sentence B follows sentence A?

$L_{NSP} = - lo g P_{θ} (IsNext ∣ [CLS])$

GPT vs BERT comparison:

	GPT (decoder-only)	BERT (encoder-only)
Attention	Causal (left-to-right)	Bidirectional (full)
Pre-training	Next-token prediction	MLM + NSP
Strength	Generation	Understanding / Classification
Representation	$h_{i}$ sees only $t_{1}, \dots, t_{i}$	$h_{i}$ sees all tokens

Transfer Learning

Pre-train on large corpus $D_{pre}$ to get $θ^{*}$ , then fine-tune on target task $D_{ft}$ :

$θ_{ft} = ar g min_{θ} L_{ft} (θ; D_{ft}), θ \leftarrow θ^{*} (initialised)$

LoRA — freeze $W_{0}$ , inject low-rank update $Δ W = B A$ ( $B \in R^{d \times r}, A \in R^{r \times k}, r ≪ min (d, k)$ ):

$W = W_{0} + Δ W = W_{0} + B A$

Generative AI

Generative models learn the data distribution $p (x)$ and can sample new data from it.

The core distinction from discriminative models:

	Discriminative	Generative
Goal	$p (y ∥ x)$	$p (x)$ or $p (x, y)$
Output	Label / decision	New data sample
Examples	Classifier, Regression	VAE, GAN, Diffusion, LLM

Variational Autoencoder (VAE)

Encode data $x$ into a latent variable $z$ , then decode back.

Encoder — approximate posterior $q_{ϕ} (z ∣ x) \approx p (z ∣ x)$ , parameterised as Gaussian:

$q_{ϕ} (z ∣ x) = N (z; μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x)))$

Decoder — likelihood $p_{θ} (x ∣ z)$ .

ELBO objective (Evidence Lower Bound, maximise):

$L_{VAE} (θ, ϕ) = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))$

KL divergence between two Gaussians:

$D_{KL} (N (μ, σ^{2}) ∥ N (0, I)) = \frac{1}{2} \sum_{j} (σ_{j}^{2} + μ_{j}^{2} - 1 - lo g σ_{j}^{2})$

Reparameterisation trick — make sampling differentiable:

$z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ, ϵ \sim N (0, I)$

Generative Adversarial Network (GAN)

Two networks compete: Generator $G_{θ}$ tries to fool Discriminator $D_{ϕ}$ .

$min_{θ} max_{ϕ} E_{x \sim p_{data}} [lo g D_{ϕ} (x)] + E_{z \sim p (z)} [lo g (1 - D_{ϕ} (G_{θ} (z)))]$

$G_{θ} (z)$ : maps noise $z \sim N (0, I)$ to fake samples.
$D_{ϕ} (x) \in (0, 1)$ : probability that $x$ is real.
At equilibrium: $D_{ϕ} (x) = \frac{1}{2}$ everywhere — generator perfectly mimics data.

Diffusion Models (DDPM)

Gradually add Gaussian noise over $T$ steps (forward process), then learn to reverse it (reverse process).

Forward process — fixed Markov chain:

$q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)$

Closed-form sampling at any step $t$ (let $\overset{α}{ˉ}_{t} = \prod_{s = 1}^{t} (1 - β_{s})$ ):

$x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, ϵ \sim N (0, I)$

Reverse process — learned denoiser $ϵ_{θ} (x_{t}, t)$ :

$p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), \tilde{β}_{t} I)$

$μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))$

Training objective — predict the noise:

$L_{DDPM} = E_{t, x_{0}, ϵ} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]$

Autoregressive Generation (LLM Decoding)

Given a prompt $(t_{1}, \dots, t_{k})$ , generate token-by-token:

$t_{k + i} \sim P_{θ} (\cdot ∣ t_{1}, \dots, t_{k + i - 1})$

Temperature scaling — control sharpness of distribution:

$P_{τ} (t) = \frac{e x p ( z _{t} / τ )}{\sum _{j} e x p ( z _{j} / τ )}$

$τ \to 0$ : greedy (deterministic); $τ = 1$ : standard softmax; $τ > 1$ : more random.

Top- $p$ (nucleus) sampling — sample from smallest set $V$ s.t.:

$\sum_{t \in V} P (t) \geq p$

Reinforcement Learning from Human Feedback (RLHF)

Step 1 — Supervised Fine-Tuning (SFT): fine-tune LLM on human demonstrations.

Step 2 — Reward Model: train $r_{ϕ} (x, y)$ from preference pairs $(y_{w} ≻ y_{l})$ :

$L_{RM} = - E_{(x, y_{w}, y_{l})} [lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]$

Step 3 — PPO fine-tuning — maximise reward while staying close to SFT policy $π_{ref}$ :

$L_{RLHF} (θ) = E_{x \sim D, y \sim π_{θ}} [r_{ϕ} (x, y)] - β D_{KL} (π_{θ} (\cdot ∣ x) ∥ π_{ref} (\cdot ∣ x))$

DPO (Direct Preference Optimisation) — skips reward model, optimise preferences directly:

$L_{DPO} (θ) = - E_{(x, y_{w}, y_{l})} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]$

Large Language Models (LLM)

Architecture Overview

A decoder-only Transformer stacks $L$ blocks. Given token sequence $(t_{1}, \dots, t_{n})$ :

Tokenisation — map text to integer IDs via vocabulary $V$ , $∣ V ∣ \sim 50 k$ – $200 k$ .
Embedding — $E \in R^{∣ V ∣ \times d}$ , look up row $t_{i}$ : $h_{i}^{(0)} = E_{t_{i}} + PE_{i}$
$L$ Transformer blocks (causal masked attention + FFN).
Unembedding — project to logits and apply softmax: $l_{i} = E h_{i}^{(L)} \in R^{∣ V ∣}$

Causal (masked) self-attention — token $i$ can only attend to positions $j \leq i$ :

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}} + M) V, M_{ij} = {0 - \infty j \leq i j > i$

Tokenisation & Embedding

Byte-Pair Encoding (BPE) — iteratively merge the most frequent adjacent pair until vocabulary size $∣ V ∣$ is reached.

Token embedding + positional encoding → input representation:

$h_{i} = E_{t_{i}} + PE_{i} \in R^{d}$

RoPE (Rotary Positional Embedding) — encode position by rotating query/key vectors:

$q_{m} = R_{m} q, k_{n} = R_{n} k, q_{m}^{⊤} k_{n} = q^{⊤} R_{n - m} k$

where $R_{m}$ is a block-diagonal rotation matrix at angle $m θ_{i}$ .

Perplexity (PPL)

The standard intrinsic evaluation metric for language models — how “surprised” the model is by the test set:

$PPL (T) = exp (- \frac{1}{T} \sum_{t = 1}^{T} lo g P_{θ} (t_{t} ∣ t_{< t})) = exp (L)$

Lower PPL = model assigns higher probability to real text = better.
PPL is simply the exponential of the cross-entropy loss $L$ .
PPL $= k$ means the model is as uncertain as choosing uniformly among $k$ tokens.
GPT-2 (117M): PPL ≈ 35 on WikiText-103; GPT-4 class models: PPL $\approx$ single digits.

Bits-per-character (BPC) — alternative unit used for character-level models:

$BPC = \frac{L}{l n 2}$

Scaling Laws

Model performance scales predictably with compute $C$ , data $D$ , and parameters $N$ (Chinchilla):

$L (N, D) = \frac{A}{N ^{α}} + \frac{B}{D ^{β}} + L_{\infty}$

Optimal allocation for a compute budget $C = 6 N D$ :

$N_{opt} \propto C^{0.5}, D_{opt} \propto C^{0.5}$

i.e. tokens and parameters should scale equally.

In-Context Learning (ICL) & Prompting

LLMs can learn from examples in the prompt without updating weights.

$k$ -shot prompting — prepend $k$ (input, output) examples:

$P_{θ} (y ∣ x, (x_{1}, y_{1}), \dots, (x_{k}, y_{k}))$

Chain-of-Thought (CoT) — include reasoning steps $r$ before answer $a$ :

$P_{θ} (r, a ∣ x) = P_{θ} (r ∣ x) \cdot P_{θ} (a ∣ x, r)$

Self-consistency — sample $M$ reasoning paths, take majority vote:

$\overset{a}{^} = ar g max_{a} \sum_{m = 1}^{M} 1 [a_{m} = a]$

Retrieval-Augmented Generation (RAG)

Augment generation with retrieved documents $D_{r}$ from an external knowledge base:

$P_{θ} (y ∣ x) = \sum_{d \in D_{r}} P_{θ} (y ∣ x, d) P_{ret} (d ∣ x)$

Retrieval — encode query and documents, fetch top- $k$ by cosine similarity:

$score (x, d) = \frac{f ( x ) ^{⊤} g ( d )}{∥ f ( x ) ∥∥ g ( d ) ∥}$

KV Cache

During autoregressive inference, keys and values for all past tokens are cached — avoid recomputing on each new token:

$K_{\leq t} = [k_{1}, \dots, k_{t}], V_{\leq t} = [v_{1}, \dots, v_{t}]$

New token $t + 1$ only computes $q_{t + 1}$ , then attends over cached $K_{\leq t}, V_{\leq t}$ . Reduces per-step cost from $O (t^{2} d)$ to $O (t d)$ ; memory grows linearly with sequence length.

Mixture of Experts (MoE)

Replace the dense FFN in each Transformer block with $E$ expert FFNs. A learned router selects top- $k$ experts per token:

$MoE (x) = \sum_{i = 1}^{k} g_{i} (x) FFN_{i} (x)$

$g_{i} (x) = softmax (TopK (W_{g} x, k))_{i}$

Activated params per token: $\sim k / E$ of total params → same inference cost as a smaller dense model.
Load balancing loss encourages uniform expert utilisation: $L_{bal} = E \sum_{i} f_{i} \cdot p_{i}$ where $f_{i}$ is fraction of tokens routed to expert $i$ and $p_{i}$ is mean router probability.

Used in Mixtral, GPT-4, Switch Transformer.

Key LLM Concepts Summary

Concept	What it does	Key formula / idea
Tokenisation (BPE)	Text → integer IDs	Merge most-frequent pairs
Causal Attention	Each token sees only past	Mask $M_{ij} = - \infty$ for $j > i$
Scaling Law	Predict loss from $N, D, C$	$L \propto N^{- α} + D^{- β}$
SFT	Align model to instructions	Cross-entropy on demonstrations
RLHF / DPO	Align to human preferences	Reward signal or preference pairs
CoT Prompting	Elicit step-by-step reasoning	$P (r, a ∥ x) = P (r ∥ x) P (a ∥ x, r)$
RAG	Ground generation in facts	Retrieve then generate
LoRA	Parameter-efficient fine-tuning	$W = W_{0} + B A$ , $r ≪ d$

Multimodal AI

Multimodal models process and generate more than one modality (text, image, audio, video) within a unified framework.

$f_{θ} : M_{1} \times M_{2} \times \dots \to Y$

Modality Encoding

Each modality is first encoded into a shared embedding space $R^{d}$ :

Modality	Encoder	Output
Text	Tokeniser + Embedding	$h_{t} \in R^{L \times d}$
Image	ViT / CNN patch encoder	$h_{v} \in R^{N \times d}$
Audio	Spectrogram + Conv / Whisper	$h_{a} \in R^{S \times d}$
Video	Frame-level ViT + temporal attention	$h_{f} \in R^{T \times N \times d}$

CLIP — Contrastive Vision-Language Pre-training

Learn aligned image and text embeddings by maximising agreement between matched pairs.

Given a batch of $N$ (image $v_{i}$ , text $t_{i}$ ) pairs, compute cosine similarities:

$s_{ij} = \frac{f ( v _{i} ) ^{⊤} g ( t _{j} )}{∥ f ( v _{i} ) ∥∥ g ( t _{j} ) ∥} \cdot exp (τ)$

Symmetric InfoNCE loss (maximise diagonal, minimise off-diagonal):

$L_{CLIP} = - \frac{1}{2 N} \sum_{i = 1}^{N} [lo g \frac{e ^{s_{ii}}}{\sum _{j} e ^{s_{ij}}} + lo g \frac{e ^{s_{ii}}}{\sum _{j} e ^{s_{ji}}}]$

At test time: zero-shot classification by picking the text prompt with highest cosine similarity to the image.

Vision-Language Models (VLM)

Connect a vision encoder to an LLM via a projection layer.

Architecture:

$h_{v} = VisualEncoder (I) \in R^{N \times d_{v}}$

$\tilde{h}_{v} = W_{proj} h_{v} \in R^{N \times d}, W_{proj} \in R^{d \times d_{v}}$

Visual tokens $\tilde{h}_{v}$ are prepended (or interleaved) with text tokens and fed into the LLM:

$input = [\tilde{h}_{v}; h_{text}]$

LLaVA-style training — two stages:

Pre-train projection only (freeze encoder + LLM): learn $W_{proj}$
Instruction fine-tuning: unfreeze LLM, train on (image, instruction, response) triplets

Objective — standard next-token prediction on response tokens only:

$L = - \sum_{t} lo g P_{θ} (y_{t} ∣ I, x, y_{< t})$

Allow one modality to attend over another. Text queries attend to visual keys/values:

$h_{text}^{'} = Attention (Q_{text}, K_{vision}, V_{vision})$

$= softmax (\frac{Q _{text} K _{vision}^{⊤}}{d _{k}}) V_{vision}$

Used in Flamingo, Perceiver Resampler, etc.

Image Generation — Text-to-Image

Condition a diffusion model on a text embedding $c = g_{ϕ} (prompt)$ .

Classifier-Free Guidance (CFG) — blend conditional and unconditional score:

$\tilde{ϵ}_{θ} (x_{t}, c) = ϵ_{θ} (x_{t}, \emptyset) + w (ϵ_{θ} (x_{t}, c) - ϵ_{θ} (x_{t}, \emptyset))$

$w > 1$ : stronger text conditioning (higher fidelity to prompt, less diversity).
$w = 1$ : standard conditional generation.
$w = 0$ : unconditional generation.

Latent Diffusion (Stable Diffusion) — run diffusion in a compressed latent space:

$z = E (x), \hat{x} = D (z), L = E_{t, z_{0}, ϵ} [∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥^{2}]$

Multimodal Summary

| Model / Concept | Modalities | Key idea | | -------------------- | ---------------- | ------------------------------------- | --- | | CLIP | Image + Text | Contrastive alignment, InfoNCE | | LLaVA / InternVL | Image + Text | Visual tokens → LLM via projection | | Flamingo | Image + Text | Cross-modal attention layers | | Stable Diffusion | Text → Image | Latent diffusion + CFG | | Whisper | Audio → Text | Spectrogram encoder + decoder | | GPT-4o | Image/Audio/Text | Unified multimodal autoregressive LLM | $ |

Brice Atlas

Atlas