Section 01

What is pre-training?

Learning from raw text, no labels required

When people say a model “knows” things — that Paris is in France, that Python lists are zero-indexed, that a sonnet has fourteen lines — almost all of that knowledge was installed during one process: pre-training pre-training The first phase of building a language model: training on an enormous corpus of raw text to predict the next token, learning general-purpose language ability before any task-specific tuning. See in glossary → . It is the first phase of building a large language model LLM Large Language Model — a neural network trained on huge text corpora to predict the next token given previous tokens. See in glossary → , an enormously expensive one, and the one this entire explainer is about.

The premise is almost suspiciously simple. Take a staggering amount of text — most of the public web, a large fraction of the world’s books, a great deal of code — and train a neural network neural network A function built by stacking many simple operations — mostly matrix multiplies with nonlinearities between them — whose behavior is shaped by tuning billions of internal numbers (its parameters) from data. See in glossary → to do one thing: predict the next word. Do that at enough scale, for long enough, and the network is forced to learn grammar, facts, reasoning patterns, translation, arithmetic, and code, because all of those are useful for guessing what comes next. No one ever tells it the rules of French or the syntax of Python. It infers them, because inferring them lowers its prediction error.

Three phases: pre-training, post-training, inference

It helps to place pre-training next to its neighbors.

  • Pre-training learns a general-purpose foundation model foundation model A large model pre-trained on broad data that can be adapted to many downstream tasks. The pre-trained LLM is the foundation; fine-tuning specializes it. See in glossary → from raw text. This is where the parameters parameters The numbers (weights) inside a model that get adjusted during training. A “7B model” has 7 billion of them. See in glossary → — the billions of numbers that are the model — get their values. It can take months on tens of thousands of GPUs.
  • Post-training (supervised fine-tuning fine-tuning Continuing to train a pre-trained model on a smaller, task- or behavior-specific dataset. This explainer is about pre-training; fine-tuning and other post-training steps are out of scope. See in glossary → , reinforcement learning, alignment) then shapes that foundation into a helpful assistant that follows instructions and refuses harmful requests. It can be substantial in its own right — modern reinforcement-learning pipelines are far from trivial — but it reshapes behavior rather than rebuilding the model’s core knowledge.
  • Inference inference Running a trained model to produce outputs. Training learns the weights once; inference uses them many times. See in glossary → is running the finished model to answer your prompts.

The split matters because the phases do fundamentally different jobs. Pre-training is hugely expensive — a frontier run can cost tens of millions of dollars — and it defines what the model fundamentally knows. Post-training shapes how that knowledge is expressed: following instructions, refusing harmful requests, reasoning step by step. That is why a new “frontier model” is news: somebody ran pre-training again, bigger or better, and the foundation moved.

Why “self-supervised” is the whole trick

Classic machine learning is supervised: you need labeled examples — photos tagged “cat” or “dog,” sentences tagged with their sentiment. Labels are made by humans, so they are scarce and expensive. You will never hand-label a trillion examples.

Next-token prediction sidesteps this entirely. Given the text “the cat sat on the”, the “label” for what comes next is just the actual next word in the document — “mat”. The data labels itself. This is self-supervised learning self-supervised learning Training where the labels come for free from the data itself — e.g. hide the next word and ask the model to predict it. No human annotation needed, which is what makes training on trillions of tokens possible. See in glossary → , and it is the reason pre-training can consume trillions of tokens: every sentence ever written is already a pile of free training examples, one per position.

Why scale works at all

The bet underneath modern AI is that prediction is understanding in disguise. To predict the next token of a murder mystery’s final page, it helps to have tracked the plot. To predict the next line of a proof, it helps to have learned the math. To complete a function, it helps to know the API. The pressure to predict well, applied across a broad enough corpus corpus The body of text a model is trained on. Modern pre-training corpora are measured in trillions of tokens drawn from web crawls, books, code, and more. See in glossary → , pushes the network to build internal machinery that looks a lot like knowledge and reasoning.

For a long time it was not obvious this would keep paying off. It did — and remarkably smoothly. Bigger models trained on more data get predictably better, following clean mathematical curves we will meet in the scaling-laws chapters. That predictability is what justified spending ever-larger sums: you could forecast the payoff before you paid.

A pre-trained model is valued precisely because it generalizes generalization How well a model performs on data it never saw during training. The whole point of pre-training is to generalize, not to memorize the corpus. See in glossary → — it performs well on downstream tasks downstream task Any specific job (translation, question answering, coding) a pre-trained model is later applied to. Pre-training is deliberately task-agnostic so it transfers to many downstream tasks. See in glossary → and text it never saw — rather than memorizing its corpus. (Memorization, or overfitting overfitting When a model memorizes training-set quirks instead of learning general patterns, so it does well on training data but poorly on new data. Rarely the main worry in single-epoch LLM pre-training, but it shapes data choices. See in glossary → , is a failure mode we mostly dodge by training roughly one pass over a deduplicated dataset, as we will see.)

What the rest of this explainer covers

We build up in layers:

  1. Foundations (you are here): the objective, the gradient that optimizes it, the optimizer, numerical precision, the GPU memory budget, parallelism, and the data pipeline. This is the machinery every model shares.
  2. The transformer and the paradigm: the papers that invented the architecture and the pre-train-then-adapt recipe — Attention Is All You Need, GPT-1, BERT, GPT-2, T5.
  3. Scaling laws: how the field learned to predict and budget training — Kaplan, GPT-3, Chinchilla.
  4. The modern era: what today’s open models actually do differently — Llama 3, DeepSeek-V3, Qwen2.5, the Gemmas, synthetic data.
  5. The 2026 frontier: the latest reports, focusing only on what is genuinely new.

Throughout, we care about both halves of the craft: the machine-learning science and the systems engineering — the FLOPs, the bytes, the data choices — that turn an equation into a real training run. Let’s start with the equation.