Related notes: COMPSCI 714, Transformers, Mathematical Foundations, Build GPT-2 from Scratch, AI Systems, Soft Computing Explained

Roadmap

---
config:
  mindmap:
    padding: 18
    maxNodeWidth: 220
---
mindmap
  root((**Deep Learning**))
    **Probability and Statistics**
      Gaussian Distribution
      Expectation and Variance
      Entropy
      KL Divergence
      Bayes Theorem
    **Fundamentals**
      Linear Regression
      Loss Functions
        SE MSE
        Cross-Entropy
      Derivatives
        Gradient
        Chain Rule
        Gradient Descent
    **Linear Algebra**
      Vectors
        Dot Product
        Norm
        Cosine Similarity
      Matrices
        Matrix Multiply
        Transpose
        Broadcast
    **Neural Networks**
      Single Neuron
      MLP
        Forward Pass
        Backpropagation
      Activations
        ReLU Sigmoid Tanh
        Softmax GELU
      Normalisation
        Batch Norm
        Layer Norm
      Pathologies
        Vanishing Gradient
        Exploding Gradient
        Zigzag
      ResNet Skip Connection
      Training
        Dropout
        Adam
        L2 Regularisation
    **Architectures**
      CNN
        2D Convolution
        Pooling
        Feature Maps
      RNN
        Vanilla RNN
        LSTM
        GRU
      Attention
        Scaled Dot-Product
        Multi-Head
        Positional Encoding
        Transformer Block
      Encoder-Decoder
        BERT MLM
        GPT vs BERT
      ViT
    **Transfer Learning**
      Pre-train to Fine-tune
      LoRA
    **Generative AI**
      VAE
        ELBO
        Reparameterisation
      GAN
        Generator
        Discriminator
      Diffusion DDPM
        Forward Noise
        Reverse Denoise
      Autoregressive
        Temperature
        Top-p Sampling
      RLHF
        Reward Model
        PPO
        DPO
    **LLM**
      Architecture
        Tokenisation BPE
        Causal Attention
        Unembedding
      KV Cache
      MoE
      Perplexity PPL
      Scaling Laws
      Prompting
        k-shot
        Chain-of-Thought
        Self-Consistency
      RAG
    **Multimodal AI**
      CLIP
        InfoNCE Loss
      VLM
        Visual Encoder
        Projection
      Cross-Modal Attention
      Text-to-Image
        CFG
        Latent Diffusion

Probability & Statistics Foundations

Gaussian Distribution

The most important distribution in deep learning. A random variable has density:

  • : mean (centre of the bell curve)
  • : variance (spread); is the standard deviation

Multivariate Gaussian , covariance matrix:

Expectation & Variance

Entropy & KL Divergence

Entropy — measures uncertainty of distribution :

KL Divergence — measures how much differs from (not a distance — asymmetric):

Cross-entropy loss is directly related: where is the true label distribution.

Bayes’ Theorem

  • : prior — belief about parameters before seeing data
  • : likelihood — how well explains data
  • : posterior — updated belief after data
  • : evidence — normalisation constant

MLE (Maximum Likelihood Estimation) maximises ; MAP adds a prior regulariser.


Some ML essential concept in Methematical way

Linear Regression(LR)

In matrix form over samples:

Square Error(SE)

Mean Square Error(MSE)

Cross-Entropy Loss (CE)

Used for classification. For a true label and predicted probability vector :

For binary classification ():

Over samples (multiclass):

Derivatives

Derivative — instantaneous rate of change:

Gradient — multivariate generalisation :

Chain Rule — for :

For deep compositions :

Gradient Descent — minimise by stepping opposite the gradient:

Four Step Process For Machine Learning

  1. Collect the data
  2. Define the model’s structure
  3. Define the loss function
  4. Minimize the loss

Vector is all you need

A vector encodes a data point with features.

Operations:

  • Add:
  • Dot product:
  • Norm:
  • Cosine similarity:

Matrix

A matrix represents a linear transformation .

Operations:

  • Add:
  • Mul (matrix product):
  • Broadcast: scalar/vector ops extend across batch dimensions
  • Dot product / inner product:
  • Transpose:

Neural Network

Single neuron:

MLP — layer forward pass:

Backpropagation (chain rule applied layer by layer):

Activation:

ReLU:

Sigmoid:

Tanh:

Softmax:

GELU (Gaussian Error Linear Unit) — used in BERT, GPT, and all modern Transformers:

where is the standard Gaussian CDF. Practical approximation:

Unlike ReLU, GELU is smooth and non-zero for with small probability, which helps gradient flow.

Training detail

  • Parameters : learned by gradient descent.
  • Hyperparameters: learning rate , batch size , layers , hidden dim — set before training.

Regularisation (weight decay):

Dropout — randomly zero out neurons during training with probability :

At inference, no dropout is applied (weights already scaled by at train time — inverted dropout).

Batch Normalisation (BN) — normalise activations across the batch dimension, then scale and shift:

where are batch mean/variance; are learned parameters.

Layer Normalisation (LN) — same formula but normalise across the feature dimension (not batch). Preferred in Transformers because it is batch-size independent:

Adam optimiser — bias-corrected moment estimates:

Vanishing & Exploding Gradients

In a deep network with layers, the gradient flowing back to layer is a product of Jacobians:

Each factor . If the spectral norm repeatedly, gradients vanish exponentially; if , they explode.

  • Vanishing: gradients , early layers learn nothing. Caused by sigmoid/tanh saturation () + many layers.
  • Exploding: parameter updates become huge, training diverges.

Fixes: ReLU (gradient = 1 in positive region), residual connections, LayerNorm, gradient clipping:

Zigzag in Gradient Descent

Vanilla SGD with a fixed learning rate oscillates (zigzags) when the loss surface has different curvatures along different directions — it overshoots in high-curvature directions and undershoots in low-curvature ones:

If is large enough to make progress along the flat direction, it overshoots along the steep direction → zigzag trajectory.

Momentum damps oscillations by accumulating a velocity vector:

The exponential moving average of gradients cancels out oscillating components while reinforcing consistent directions. Adam further applies per-parameter adaptive learning rates via (second moment), which is why it almost always converges faster than SGD on non-convex landscapes.

ResNet — Residual Connections

Add a skip connection that bypasses one or more layers, letting gradients flow directly to earlier layers:

  • : the residual to learn (e.g. two conv layers)
  • : identity shortcut

Why it works: the gradient through the skip path is exactly — no matter how deep, the chain rule always has a direct path with gradient . This prevents vanishing gradients in very deep networks (ResNet-152, Transformers).

CNN

2D convolution — kernel slides over image :

Each layer applies kernels → feature map .

Pooling — downsample spatial dimensions to reduce computation and add translation invariance:

where is the pool size and is the stride.

RNN

Vanilla RNN — hidden state recurrence:

LSTM — gated memory cell : input/forget/output gates:

GRU — simplified gating (reset , update ):

Attention

Scaled Dot-Product Attention:

Multi-Head Attention parallel heads, then project:

Positional Encoding (sinusoidal):

Transformer block (with residual + LayerNorm):

Language Modelling Objective — maximise log-likelihood over token sequence:

ViT — split image into patches , linearly project to token embeddings, then apply Transformer encoder:

Encoder-Decoder Transformer (original seq2seq, e.g. T5, BART) — encoder processes source sequence bidirectionally; decoder generates target tokens autoregressively with cross-attention over encoder outputs:

Decoder block = (causal self-attention) → (cross-attention to encoder) → (FFN), each with residual + LayerNorm.

BERT — Bidirectional Encoder

BERT uses the encoder-only Transformer and pre-trains on two objectives:

1. Masked Language Model (MLM) — randomly mask 15 % of tokens, predict them from full bidirectional context:

  • : set of masked positions
  • : all tokens except the masked ones

Because attention is bidirectional (no causal mask), every token can attend to every other token — unlike GPT which only sees past context.

2. Next Sentence Prediction (NSP) — binary classification: are sentence B follows sentence A?

GPT vs BERT comparison:

GPT (decoder-only)BERT (encoder-only)
AttentionCausal (left-to-right)Bidirectional (full)
Pre-trainingNext-token predictionMLM + NSP
StrengthGenerationUnderstanding / Classification
Representation sees only sees all tokens

Transfer Learning

Pre-train on large corpus to get , then fine-tune on target task :

LoRA — freeze , inject low-rank update ():


Generative AI

Generative models learn the data distribution and can sample new data from it.

The core distinction from discriminative models:

DiscriminativeGenerative
Goal or
OutputLabel / decisionNew data sample
ExamplesClassifier, RegressionVAE, GAN, Diffusion, LLM

Variational Autoencoder (VAE)

Encode data into a latent variable , then decode back.

Encoder — approximate posterior , parameterised as Gaussian:

Decoder — likelihood .

ELBO objective (Evidence Lower Bound, maximise):

KL divergence between two Gaussians:

Reparameterisation trick — make sampling differentiable:


Generative Adversarial Network (GAN)

Two networks compete: Generator tries to fool Discriminator .

  • : maps noise to fake samples.
  • : probability that is real.
  • At equilibrium: everywhere — generator perfectly mimics data.

Diffusion Models (DDPM)

Gradually add Gaussian noise over steps (forward process), then learn to reverse it (reverse process).

Forward process — fixed Markov chain:

Closed-form sampling at any step (let ):

Reverse process — learned denoiser :

Training objective — predict the noise:


Autoregressive Generation (LLM Decoding)

Given a prompt , generate token-by-token:

Temperature scaling — control sharpness of distribution:

  • : greedy (deterministic); : standard softmax; : more random.

Top- (nucleus) sampling — sample from smallest set s.t.:


Reinforcement Learning from Human Feedback (RLHF)

Step 1 — Supervised Fine-Tuning (SFT): fine-tune LLM on human demonstrations.

Step 2 — Reward Model: train from preference pairs :

Step 3 — PPO fine-tuning — maximise reward while staying close to SFT policy :

DPO (Direct Preference Optimisation) — skips reward model, optimise preferences directly:


Large Language Models (LLM)

Architecture Overview

A decoder-only Transformer stacks blocks. Given token sequence :

  1. Tokenisation — map text to integer IDs via vocabulary , .
  2. Embedding, look up row :
  3. Transformer blocks (causal masked attention + FFN).
  4. Unembedding — project to logits and apply softmax:

Causal (masked) self-attention — token can only attend to positions :


Tokenisation & Embedding

Byte-Pair Encoding (BPE) — iteratively merge the most frequent adjacent pair until vocabulary size is reached.

Token embedding + positional encoding → input representation:

RoPE (Rotary Positional Embedding) — encode position by rotating query/key vectors:

where is a block-diagonal rotation matrix at angle .


Perplexity (PPL)

The standard intrinsic evaluation metric for language models — how “surprised” the model is by the test set:

  • Lower PPL = model assigns higher probability to real text = better.
  • PPL is simply the exponential of the cross-entropy loss .
  • PPL means the model is as uncertain as choosing uniformly among tokens.
  • GPT-2 (117M): PPL ≈ 35 on WikiText-103; GPT-4 class models: PPL single digits.

Bits-per-character (BPC) — alternative unit used for character-level models:


Scaling Laws

Model performance scales predictably with compute , data , and parameters (Chinchilla):

Optimal allocation for a compute budget :

i.e. tokens and parameters should scale equally.


In-Context Learning (ICL) & Prompting

LLMs can learn from examples in the prompt without updating weights.

-shot prompting — prepend (input, output) examples:

Chain-of-Thought (CoT) — include reasoning steps before answer :

Self-consistency — sample reasoning paths, take majority vote:


Retrieval-Augmented Generation (RAG)

Augment generation with retrieved documents from an external knowledge base:

Retrieval — encode query and documents, fetch top- by cosine similarity:


KV Cache

During autoregressive inference, keys and values for all past tokens are cached — avoid recomputing on each new token:

New token only computes , then attends over cached . Reduces per-step cost from to ; memory grows linearly with sequence length.


Mixture of Experts (MoE)

Replace the dense FFN in each Transformer block with expert FFNs. A learned router selects top- experts per token:

  • Activated params per token: of total params → same inference cost as a smaller dense model.
  • Load balancing loss encourages uniform expert utilisation: where is fraction of tokens routed to expert and is mean router probability.

Used in Mixtral, GPT-4, Switch Transformer.


Key LLM Concepts Summary

ConceptWhat it doesKey formula / idea
Tokenisation (BPE)Text → integer IDsMerge most-frequent pairs
Causal AttentionEach token sees only pastMask for
Scaling LawPredict loss from
SFTAlign model to instructionsCross-entropy on demonstrations
RLHF / DPOAlign to human preferencesReward signal or preference pairs
CoT PromptingElicit step-by-step reasoning
RAGGround generation in factsRetrieve then generate
LoRAParameter-efficient fine-tuning,

Multimodal AI

Multimodal models process and generate more than one modality (text, image, audio, video) within a unified framework.

Modality Encoding

Each modality is first encoded into a shared embedding space :

ModalityEncoderOutput
TextTokeniser + Embedding
ImageViT / CNN patch encoder
AudioSpectrogram + Conv / Whisper
VideoFrame-level ViT + temporal attention

CLIP — Contrastive Vision-Language Pre-training

Learn aligned image and text embeddings by maximising agreement between matched pairs.

Given a batch of (image , text ) pairs, compute cosine similarities:

Symmetric InfoNCE loss (maximise diagonal, minimise off-diagonal):

At test time: zero-shot classification by picking the text prompt with highest cosine similarity to the image.


Vision-Language Models (VLM)

Connect a vision encoder to an LLM via a projection layer.

Architecture:

Visual tokens are prepended (or interleaved) with text tokens and fed into the LLM:

LLaVA-style training — two stages:

  1. Pre-train projection only (freeze encoder + LLM): learn
  2. Instruction fine-tuning: unfreeze LLM, train on (image, instruction, response) triplets

Objective — standard next-token prediction on response tokens only:


Cross-Modal Attention

Allow one modality to attend over another. Text queries attend to visual keys/values:

Used in Flamingo, Perceiver Resampler, etc.


Image Generation — Text-to-Image

Condition a diffusion model on a text embedding .

Classifier-Free Guidance (CFG) — blend conditional and unconditional score:

  • : stronger text conditioning (higher fidelity to prompt, less diversity).
  • : standard conditional generation.
  • : unconditional generation.

Latent Diffusion (Stable Diffusion) — run diffusion in a compressed latent space:


Multimodal Summary

| Model / Concept | Modalities | Key idea | | -------------------- | ---------------- | ------------------------------------- | --- | | CLIP | Image + Text | Contrastive alignment, InfoNCE | | LLaVA / InternVL | Image + Text | Visual tokens → LLM via projection | | Flamingo | Image + Text | Cross-modal attention layers | | Stable Diffusion | Text → Image | Latent diffusion + CFG | | Whisper | Audio → Text | Spectrogram encoder + decoder | | GPT-4o | Image/Audio/Text | Unified multimodal autoregressive LLM | $ |