AlphaFold Explained: Learning the Basics with PyTorch

Why AlphaFold Matters

AlphaFold is important because it solved a problem that sits at the boundary of biology, geometry, probability, and machine learning:

proteins are written as 1D amino-acid sequences
their function depends on a 3D folded structure
predicting that 3D structure from sequence alone is hard

In other words, the model has to learn how a string becomes a shape.

That is why AlphaFold is a beautiful AI system to study. It is not just “another neural network.” It is a system that:

represents sequence information
represents pairwise geometric relationships
uses attention to exchange information globally
gradually refines a structure in 3D space

For a PyTorch learner, AlphaFold is a great example of how modern deep learning moves beyond plain classification and into structured reasoning.

Environment Setup

If you want to follow the examples locally, uv is a clean way to manage Python and virtual environments.

brew install uv
mkdir alphafold_intro
cd alphafold_intro
uv init .
uv add numpy matplotlib torch torchvision jupyterlab

The important thing is not the package manager itself. The important thing is that once the environment is stable, we can focus on tensors, shapes, and model logic.

The Core Learning Question

Let a protein sequence be

s = (a_{1}, a_{2}, \dots, a_{n})

where each $a_{i}$ is an amino acid token.

The target is a set of 3D coordinates

X = (x_{1}, x_{2}, \dots, x_{n}), x_{i} \in R^{3}

for atoms or residues.

So the learning problem is:

f (s) ⟶ X

This sounds simple, but it is not. Why?

residues far apart in sequence can be close in 3D
local chemistry matters
global consistency matters
the output is geometric, not just categorical

That means the model must learn both content and relationships.

Tensor Intuition First

Before AlphaFold, we need the right tensor mental model.

In PyTorch, a tensor is just a multidimensional numerical array with autograd support:

import torch
 
x = torch.tensor([1.0, 2.0, 3.0])

The shape tells us how the data is organised.

scalar: ()
vector: (d,)
matrix: (n, d)
pairwise table: (n, n, d)

That last one is especially important. AlphaFold does not only store features for each residue. It also stores features for each pair of residues.

Sequence Representation vs Pair Representation

The key AlphaFold idea is that we need two coupled views of the protein.

1. Sequence / residue representation

For each residue $i$ , we keep a feature vector:

h_{i} \in R^{d_{s}}

Stacked together, this becomes:

H \in R^{n \times d_{s}}

This is similar to token embeddings in an LLM.

2. Pair representation

For each pair $(i, j)$ , we keep a feature vector:

p_{ij} \in R^{d_{p}}

Stacked together:

P \in R^{n \times n \times d_{p}}

This pair tensor is where AlphaFold becomes really interesting. It explicitly models whether two residues may be:

close in space
part of the same structural motif
geometrically compatible

This is much richer than a plain sequence model.

A Small PyTorch Picture

If a protein has length n = 128, then a toy version might look like this:

n = 128
d_seq = 256
d_pair = 128
 
seq_repr = torch.randn(n, d_seq)       # [n, d_seq]
pair_repr = torch.randn(n, n, d_pair)  # [n, n, d_pair]

Now we already see the computational challenge:

sequence scales like $O (n)$
pair representation scales like $O (n^{2})$

That is one reason protein models become expensive quickly.

Why Attention Helps

Classical recurrent models struggle with long-range dependency. Proteins are full of long-range dependency.

A residue near the beginning of the sequence may interact strongly with one near the end. Attention is useful because it lets each position read from all others.

For a sequence representation:

Q = H W_{Q}, K = H W_{K}, V = H W_{V}

and attention weights are:

Attn (H) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

This gives global communication across residues.

But AlphaFold goes further. It uses information from pair features to bias or guide sequence attention. So the model is not only asking:

“Which tokens are relevant?”

It is also asking:

“Which residues are likely geometrically related?”

AlphaFold as a Message-Passing System

One good high-level mental model is:

flowchart LR
  A[Sequence Tokens] --> B[Residue Representation]
  A --> C[Pair Initialization]
  B --> D[Evoformer Updates]
  C --> D
  D --> E[Structure Module]
  E --> F[3D Coordinates]
  F --> G[Refinement / Confidence]

The important thing here is not the exact block names. It is the feedback loop:

sequence updates pair
pair updates sequence
both feed structure prediction

That is why AlphaFold feels more like a system than a single layer stack.

Multiple Sequence Alignments

A huge part of AlphaFold’s power comes from evolutionary information.

If we collect related protein sequences across species, we get a multiple sequence alignment (MSA). The rough idea is:

residues that change together may be structurally linked
conserved positions often matter functionally
co-evolution provides indirect geometric evidence

So AlphaFold does not only read one sequence. It often reads a whole family of related sequences and tries to detect correlated variation.

That is conceptually similar to saying:

“The data distribution itself contains clues about structure.”

The Evoformer Intuition

You do not need to memorise every block to understand the architecture at a beginner level.

The Evoformer is basically a repeated refinement stage for:

MSA / sequence features
pairwise relation features

It does three important things:

lets residues attend to each other
lets pair features accumulate geometric hints
lets sequence and pair states exchange information repeatedly

The repeated refinement matters. One pass is usually not enough to infer a globally consistent fold.

Triangle Reasoning

One famous AlphaFold idea is triangle-style updates on pairs.

Why triangles?

If residue $i$ relates to residue $k$ , and $k$ relates to $j$ , then that provides evidence about the pair $(i, j)$ .

This is a form of relational reasoning. In plain language:

pairwise geometry should be consistent
consistency often emerges through triples

That is why pair updates are not just local MLPs. They are trying to enforce something more like geometric compatibility.

Structure Module

After enough relational refinement, the model must actually output coordinates.

At a high level, the structure module maps learned features into 3D transformations and positions:

(sequence features, pair features) \mapsto {x_{i}}_{i = 1}^{n}

This stage is where representation learning becomes geometry.

A useful beginner intuition is:

the earlier blocks learn “what relates to what”
the structure block turns those relations into coordinates

Confidence Prediction

AlphaFold is also valuable because it predicts confidence, not just structure.

This matters in science. A model should know when it may be wrong.

Common confidence ideas include:

local confidence per residue
global structure confidence
predicted alignment / distance reliability

For ML students, this is a nice reminder that good systems often predict both:

an answer
a measure of trust

A Minimal PyTorch Thought Experiment

Suppose we want a tiny educational prototype, not real AlphaFold. Then we might do:

import torch
import torch.nn as nn
 
class ToyProteinModel(nn.Module):
    def __init__(self, vocab=21, d_seq=128, d_pair=64):
        super().__init__()
        self.embed = nn.Embedding(vocab, d_seq)
        self.seq_proj = nn.Linear(d_seq, d_seq)
        self.pair_proj = nn.Linear(d_seq * 2, d_pair)
        self.coord_head = nn.Linear(d_seq, 3)
 
    def forward(self, tokens):
        x = self.embed(tokens)                   # [n, d_seq]
        x = self.seq_proj(x)
 
        xi = x.unsqueeze(1).expand(-1, x.size(0), -1)
        xj = x.unsqueeze(0).expand(x.size(0), -1, -1)
        pair = self.pair_proj(torch.cat([xi, xj], dim=-1))  # [n, n, d_pair]
 
        coords = self.coord_head(x)             # [n, 3]
        return x, pair, coords

This is not AlphaFold, but it teaches the right shape intuition:

token embedding
pair tensor construction
coordinate head

That is enough to start seeing the design space.

The Main Difference from LLMs

AlphaFold and transformers are related, but they are not the same kind of system.

An LLM mainly predicts the next token:

p (x_{t} ∣ x_{< t})

AlphaFold instead predicts a structured geometric object:

p (X ∣ s, MSA, templates)

So the output space is different:

LLMs output text distributions
AlphaFold outputs geometry and confidence

That change in output space drives a completely different architecture emphasis.

What To Focus On As a Student

If you are learning AlphaFold for the first time, focus on these ideas first:

sequence representation and pair representation are equally important
attention handles long-range dependency
evolutionary data adds powerful biological signal
geometric consistency is a core modeling constraint
the final target is a 3D structure, not a label

If those five ideas are clear, then the paper and implementation details become much easier to digest.

Closing Intuition

The deepest lesson from AlphaFold is not just “AI can predict proteins.”

It is that modern models work best when the representation matches the structure of the problem.

Proteins are not just sequences. They are:

sequences
pairwise interaction graphs
3D geometric objects
evolutionary objects

AlphaFold succeeds because it respects all of those views at once.

That is exactly the kind of systems thinking worth learning from.

Brice Atlas

Atlas