Section 01

What is pre-training?

Learning from raw text, no labels required

When people say a model “knows” things — that Paris is in France, that Python lists are zero-indexed, that a sonnet has fourteen lines — almost all of that knowledge was installed during one process: pre-training . It is the first phase of building a large language model , an enormously expensive one, and the one this entire explainer is about.

The premise is almost suspiciously simple. Take a staggering amount of text — most of the public web, a large fraction of the world’s books, a great deal of code — and train a neural network to do one thing: predict the next word. Do that at enough scale, for long enough, and the network is forced to learn grammar, facts, reasoning patterns, translation, arithmetic, and code, because all of those are useful for guessing what comes next. No one ever tells it the rules of French or the syntax of Python. It infers them, because inferring them lowers its prediction error.

Three phases: pre-training, post-training, inference

It helps to place pre-training next to its neighbors.

Pre-training learns a general-purpose foundation model from raw text. This is where the parameters — the billions of numbers that are the model — get their values. It can take months on tens of thousands of GPUs.
Post-training (supervised fine-tuning , reinforcement learning, alignment) then shapes that foundation into a helpful assistant that follows instructions and refuses harmful requests. It can be substantial in its own right — modern reinforcement-learning pipelines are far from trivial — but it reshapes behavior rather than rebuilding the model’s core knowledge.
Inference is running the finished model to answer your prompts.

The split matters because the phases do fundamentally different jobs. Pre-training is hugely expensive — a frontier run can cost tens of millions of dollars — and it defines what the model fundamentally knows. Post-training shapes how that knowledge is expressed: following instructions, refusing harmful requests, reasoning step by step. That is why a new “frontier model” is news: somebody ran pre-training again, bigger or better, and the foundation moved.

Why “self-supervised” is the whole trick

Classic machine learning is supervised: you need labeled examples — photos tagged “cat” or “dog,” sentences tagged with their sentiment. Labels are made by humans, so they are scarce and expensive. You will never hand-label a trillion examples.

Next-token prediction sidesteps this entirely. Given the text “the cat sat on the”, the “label” for what comes next is just the actual next word in the document — “mat”. The data labels itself. This is self-supervised learning , and it is the reason pre-training can consume trillions of tokens: every sentence ever written is already a pile of free training examples, one per position.

Why scale works at all

The bet underneath modern AI is that prediction is understanding in disguise. To predict the next token of a murder mystery’s final page, it helps to have tracked the plot. To predict the next line of a proof, it helps to have learned the math. To complete a function, it helps to know the API. The pressure to predict well, applied across a broad enough corpus , pushes the network to build internal machinery that looks a lot like knowledge and reasoning.

For a long time it was not obvious this would keep paying off. It did — and remarkably smoothly. Bigger models trained on more data get predictably better, following clean mathematical curves we will meet in the scaling-laws chapters. That predictability is what justified spending ever-larger sums: you could forecast the payoff before you paid.

A pre-trained model is valued precisely because it generalizes — it performs well on downstream tasks and text it never saw — rather than memorizing its corpus. (Memorization, or overfitting , is a failure mode we mostly dodge by training roughly one pass over a deduplicated dataset, as we will see.)

What the rest of this explainer covers

We build up in layers:

Foundations (you are here): the objective, the gradient that optimizes it, the optimizer, numerical precision, the GPU memory budget, parallelism, and the data pipeline. This is the machinery every model shares.
The transformer and the paradigm: the papers that invented the architecture and the pre-train-then-adapt recipe — Attention Is All You Need, GPT-1, BERT, GPT-2, T5.
Scaling laws: how the field learned to predict and budget training — Kaplan, GPT-3, Chinchilla.
The modern era: what today’s open models actually do differently — Llama 3, DeepSeek-V3, Qwen2.5, the Gemmas, synthetic data.
The 2026 frontier: the latest reports, focusing only on what is genuinely new.

Throughout, we care about both halves of the craft: the machine-learning science and the systems engineering — the FLOPs, the bytes, the data choices — that turn an equation into a real training run. Let’s start with the equation.