Related notes: COMPSCI 714, Transformers, Mathematical Foundations, Build GPT-2 from Scratch, AI Systems, Soft Computing Explained
Roadmap
--- config: mindmap: padding: 18 maxNodeWidth: 220 --- mindmap root((**Deep Learning**)) **Probability and Statistics** Gaussian Distribution Expectation and Variance Entropy KL Divergence Bayes Theorem **Fundamentals** Linear Regression Loss Functions SE MSE Cross-Entropy Derivatives Gradient Chain Rule Gradient Descent **Linear Algebra** Vectors Dot Product Norm Cosine Similarity Matrices Matrix Multiply Transpose Broadcast **Neural Networks** Single Neuron MLP Forward Pass Backpropagation Activations ReLU Sigmoid Tanh Softmax GELU Normalisation Batch Norm Layer Norm Pathologies Vanishing Gradient Exploding Gradient Zigzag ResNet Skip Connection Training Dropout Adam L2 Regularisation **Architectures** CNN 2D Convolution Pooling Feature Maps RNN Vanilla RNN LSTM GRU Attention Scaled Dot-Product Multi-Head Positional Encoding Transformer Block Encoder-Decoder BERT MLM GPT vs BERT ViT **Transfer Learning** Pre-train to Fine-tune LoRA **Generative AI** VAE ELBO Reparameterisation GAN Generator Discriminator Diffusion DDPM Forward Noise Reverse Denoise Autoregressive Temperature Top-p Sampling RLHF Reward Model PPO DPO **LLM** Architecture Tokenisation BPE Causal Attention Unembedding KV Cache MoE Perplexity PPL Scaling Laws Prompting k-shot Chain-of-Thought Self-Consistency RAG **Multimodal AI** CLIP InfoNCE Loss VLM Visual Encoder Projection Cross-Modal Attention Text-to-Image CFG Latent Diffusion
Probability & Statistics Foundations
Gaussian Distribution
The most important distribution in deep learning. A random variable has density:
- : mean (centre of the bell curve)
- : variance (spread); is the standard deviation
Multivariate Gaussian , covariance matrix:
Expectation & Variance
Entropy & KL Divergence
Entropy — measures uncertainty of distribution :
KL Divergence — measures how much differs from (not a distance — asymmetric):
Cross-entropy loss is directly related: where is the true label distribution.
Bayes’ Theorem
- : prior — belief about parameters before seeing data
- : likelihood — how well explains data
- : posterior — updated belief after data
- : evidence — normalisation constant
MLE (Maximum Likelihood Estimation) maximises ; MAP adds a prior regulariser.
Some ML essential concept in Methematical way
Linear Regression(LR)
In matrix form over samples:
Square Error(SE)
Mean Square Error(MSE)
Cross-Entropy Loss (CE)
Used for classification. For a true label and predicted probability vector :
For binary classification ():
Over samples (multiclass):
Derivatives
Derivative — instantaneous rate of change:
Gradient — multivariate generalisation :
Chain Rule — for :
For deep compositions :
Gradient Descent — minimise by stepping opposite the gradient:
Four Step Process For Machine Learning
- Collect the data
- Define the model’s structure
- Define the loss function
- Minimize the loss
Vector is all you need
A vector encodes a data point with features.
Operations:
- Add:
- Dot product:
- Norm:
- Cosine similarity:
Matrix
A matrix represents a linear transformation .
Operations:
- Add:
- Mul (matrix product):
- Broadcast: scalar/vector ops extend across batch dimensions
- Dot product / inner product:
- Transpose:
Neural Network
Single neuron:
MLP — layer forward pass:
Backpropagation (chain rule applied layer by layer):
Activation:
ReLU:
Sigmoid:
Tanh:
Softmax:
GELU (Gaussian Error Linear Unit) — used in BERT, GPT, and all modern Transformers:
where is the standard Gaussian CDF. Practical approximation:
Unlike ReLU, GELU is smooth and non-zero for with small probability, which helps gradient flow.
Training detail
- Parameters : learned by gradient descent.
- Hyperparameters: learning rate , batch size , layers , hidden dim — set before training.
Regularisation (weight decay):
Dropout — randomly zero out neurons during training with probability :
At inference, no dropout is applied (weights already scaled by at train time — inverted dropout).
Batch Normalisation (BN) — normalise activations across the batch dimension, then scale and shift:
where are batch mean/variance; are learned parameters.
Layer Normalisation (LN) — same formula but normalise across the feature dimension (not batch). Preferred in Transformers because it is batch-size independent:
Adam optimiser — bias-corrected moment estimates:
Vanishing & Exploding Gradients
In a deep network with layers, the gradient flowing back to layer is a product of Jacobians:
Each factor . If the spectral norm repeatedly, gradients vanish exponentially; if , they explode.
- Vanishing: gradients , early layers learn nothing. Caused by sigmoid/tanh saturation () + many layers.
- Exploding: parameter updates become huge, training diverges.
Fixes: ReLU (gradient = 1 in positive region), residual connections, LayerNorm, gradient clipping:
Zigzag in Gradient Descent
Vanilla SGD with a fixed learning rate oscillates (zigzags) when the loss surface has different curvatures along different directions — it overshoots in high-curvature directions and undershoots in low-curvature ones:
If is large enough to make progress along the flat direction, it overshoots along the steep direction → zigzag trajectory.
Momentum damps oscillations by accumulating a velocity vector:
The exponential moving average of gradients cancels out oscillating components while reinforcing consistent directions. Adam further applies per-parameter adaptive learning rates via (second moment), which is why it almost always converges faster than SGD on non-convex landscapes.
ResNet — Residual Connections
Add a skip connection that bypasses one or more layers, letting gradients flow directly to earlier layers:
- : the residual to learn (e.g. two conv layers)
- : identity shortcut
Why it works: the gradient through the skip path is exactly — no matter how deep, the chain rule always has a direct path with gradient . This prevents vanishing gradients in very deep networks (ResNet-152, Transformers).
CNN
2D convolution — kernel slides over image :
Each layer applies kernels → feature map .
Pooling — downsample spatial dimensions to reduce computation and add translation invariance:
where is the pool size and is the stride.
RNN
Vanilla RNN — hidden state recurrence:
LSTM — gated memory cell : input/forget/output gates:
GRU — simplified gating (reset , update ):
Attention
Scaled Dot-Product Attention — :
Multi-Head Attention — parallel heads, then project:
Positional Encoding (sinusoidal):
Transformer block (with residual + LayerNorm):
Language Modelling Objective — maximise log-likelihood over token sequence:
ViT — split image into patches , linearly project to token embeddings, then apply Transformer encoder:
Encoder-Decoder Transformer (original seq2seq, e.g. T5, BART) — encoder processes source sequence bidirectionally; decoder generates target tokens autoregressively with cross-attention over encoder outputs:
Decoder block = (causal self-attention) → (cross-attention to encoder) → (FFN), each with residual + LayerNorm.
BERT — Bidirectional Encoder
BERT uses the encoder-only Transformer and pre-trains on two objectives:
1. Masked Language Model (MLM) — randomly mask 15 % of tokens, predict them from full bidirectional context:
- : set of masked positions
- : all tokens except the masked ones
Because attention is bidirectional (no causal mask), every token can attend to every other token — unlike GPT which only sees past context.
2. Next Sentence Prediction (NSP) — binary classification: are sentence B follows sentence A?
GPT vs BERT comparison:
| GPT (decoder-only) | BERT (encoder-only) | |
|---|---|---|
| Attention | Causal (left-to-right) | Bidirectional (full) |
| Pre-training | Next-token prediction | MLM + NSP |
| Strength | Generation | Understanding / Classification |
| Representation | sees only | sees all tokens |
Transfer Learning
Pre-train on large corpus to get , then fine-tune on target task :
LoRA — freeze , inject low-rank update ():
Generative AI
Generative models learn the data distribution and can sample new data from it.
The core distinction from discriminative models:
| Discriminative | Generative | |
|---|---|---|
| Goal | or | |
| Output | Label / decision | New data sample |
| Examples | Classifier, Regression | VAE, GAN, Diffusion, LLM |
Variational Autoencoder (VAE)
Encode data into a latent variable , then decode back.
Encoder — approximate posterior , parameterised as Gaussian:
Decoder — likelihood .
ELBO objective (Evidence Lower Bound, maximise):
KL divergence between two Gaussians:
Reparameterisation trick — make sampling differentiable:
Generative Adversarial Network (GAN)
Two networks compete: Generator tries to fool Discriminator .
- : maps noise to fake samples.
- : probability that is real.
- At equilibrium: everywhere — generator perfectly mimics data.
Diffusion Models (DDPM)
Gradually add Gaussian noise over steps (forward process), then learn to reverse it (reverse process).
Forward process — fixed Markov chain:
Closed-form sampling at any step (let ):
Reverse process — learned denoiser :
Training objective — predict the noise:
Autoregressive Generation (LLM Decoding)
Given a prompt , generate token-by-token:
Temperature scaling — control sharpness of distribution:
- : greedy (deterministic); : standard softmax; : more random.
Top- (nucleus) sampling — sample from smallest set s.t.:
Reinforcement Learning from Human Feedback (RLHF)
Step 1 — Supervised Fine-Tuning (SFT): fine-tune LLM on human demonstrations.
Step 2 — Reward Model: train from preference pairs :
Step 3 — PPO fine-tuning — maximise reward while staying close to SFT policy :
DPO (Direct Preference Optimisation) — skips reward model, optimise preferences directly:
Large Language Models (LLM)
Architecture Overview
A decoder-only Transformer stacks blocks. Given token sequence :
- Tokenisation — map text to integer IDs via vocabulary , –.
- Embedding — , look up row :
- Transformer blocks (causal masked attention + FFN).
- Unembedding — project to logits and apply softmax:
Causal (masked) self-attention — token can only attend to positions :
Tokenisation & Embedding
Byte-Pair Encoding (BPE) — iteratively merge the most frequent adjacent pair until vocabulary size is reached.
Token embedding + positional encoding → input representation:
RoPE (Rotary Positional Embedding) — encode position by rotating query/key vectors:
where is a block-diagonal rotation matrix at angle .
Perplexity (PPL)
The standard intrinsic evaluation metric for language models — how “surprised” the model is by the test set:
- Lower PPL = model assigns higher probability to real text = better.
- PPL is simply the exponential of the cross-entropy loss .
- PPL means the model is as uncertain as choosing uniformly among tokens.
- GPT-2 (117M): PPL ≈ 35 on WikiText-103; GPT-4 class models: PPL single digits.
Bits-per-character (BPC) — alternative unit used for character-level models:
Scaling Laws
Model performance scales predictably with compute , data , and parameters (Chinchilla):
Optimal allocation for a compute budget :
i.e. tokens and parameters should scale equally.
In-Context Learning (ICL) & Prompting
LLMs can learn from examples in the prompt without updating weights.
-shot prompting — prepend (input, output) examples:
Chain-of-Thought (CoT) — include reasoning steps before answer :
Self-consistency — sample reasoning paths, take majority vote:
Retrieval-Augmented Generation (RAG)
Augment generation with retrieved documents from an external knowledge base:
Retrieval — encode query and documents, fetch top- by cosine similarity:
KV Cache
During autoregressive inference, keys and values for all past tokens are cached — avoid recomputing on each new token:
New token only computes , then attends over cached . Reduces per-step cost from to ; memory grows linearly with sequence length.
Mixture of Experts (MoE)
Replace the dense FFN in each Transformer block with expert FFNs. A learned router selects top- experts per token:
- Activated params per token: of total params → same inference cost as a smaller dense model.
- Load balancing loss encourages uniform expert utilisation: where is fraction of tokens routed to expert and is mean router probability.
Used in Mixtral, GPT-4, Switch Transformer.
Key LLM Concepts Summary
| Concept | What it does | Key formula / idea |
|---|---|---|
| Tokenisation (BPE) | Text → integer IDs | Merge most-frequent pairs |
| Causal Attention | Each token sees only past | Mask for |
| Scaling Law | Predict loss from | |
| SFT | Align model to instructions | Cross-entropy on demonstrations |
| RLHF / DPO | Align to human preferences | Reward signal or preference pairs |
| CoT Prompting | Elicit step-by-step reasoning | |
| RAG | Ground generation in facts | Retrieve then generate |
| LoRA | Parameter-efficient fine-tuning | , |
Multimodal AI
Multimodal models process and generate more than one modality (text, image, audio, video) within a unified framework.
Modality Encoding
Each modality is first encoded into a shared embedding space :
| Modality | Encoder | Output |
|---|---|---|
| Text | Tokeniser + Embedding | |
| Image | ViT / CNN patch encoder | |
| Audio | Spectrogram + Conv / Whisper | |
| Video | Frame-level ViT + temporal attention |
CLIP — Contrastive Vision-Language Pre-training
Learn aligned image and text embeddings by maximising agreement between matched pairs.
Given a batch of (image , text ) pairs, compute cosine similarities:
Symmetric InfoNCE loss (maximise diagonal, minimise off-diagonal):
At test time: zero-shot classification by picking the text prompt with highest cosine similarity to the image.
Vision-Language Models (VLM)
Connect a vision encoder to an LLM via a projection layer.
Architecture:
Visual tokens are prepended (or interleaved) with text tokens and fed into the LLM:
LLaVA-style training — two stages:
- Pre-train projection only (freeze encoder + LLM): learn
- Instruction fine-tuning: unfreeze LLM, train on (image, instruction, response) triplets
Objective — standard next-token prediction on response tokens only:
Cross-Modal Attention
Allow one modality to attend over another. Text queries attend to visual keys/values:
Used in Flamingo, Perceiver Resampler, etc.
Image Generation — Text-to-Image
Condition a diffusion model on a text embedding .
Classifier-Free Guidance (CFG) — blend conditional and unconditional score:
- : stronger text conditioning (higher fidelity to prompt, less diversity).
- : standard conditional generation.
- : unconditional generation.
Latent Diffusion (Stable Diffusion) — run diffusion in a compressed latent space:
Multimodal Summary
| Model / Concept | Modalities | Key idea | | -------------------- | ---------------- | ------------------------------------- | --- | | CLIP | Image + Text | Contrastive alignment, InfoNCE | | LLaVA / InternVL | Image + Text | Visual tokens → LLM via projection | | Flamingo | Image + Text | Cross-modal attention layers | | Stable Diffusion | Text → Image | Latent diffusion + CFG | | Whisper | Audio → Text | Spectrogram encoder + decoder | | GPT-4o | Image/Audio/Text | Unified multimodal autoregressive LLM | $ |