The objective
Next-token prediction and cross-entropy loss
Everything a language model learns in pre-training is squeezed through one narrow objective: predict the next token. This chapter makes that objective precise — what the model outputs, how we score it, and why this particular scoring rule is the right one.
The model is a next-token probability machine
A language model language model A model that assigns probabilities to sequences of tokens — in practice, one that predicts the probability distribution of the next token given the preceding ones. See in glossary → takes a sequence of tokens token The atomic unit of text the model sees. Roughly a word-fragment — “tokenization” is a piece of text → list of token IDs. See in glossary → and outputs a probability distribution over what the next token will be. Concretely, given a context , it produces
a number for every entry in the vocabulary vocabulary The fixed set of tokens a model knows about. Modern LLMs have ~32k–200k entries. See in glossary → (often 100k–256k entries), all non-negative and summing to 1. The subscript (the Greek letter theta) is the model’s parameters — the thing training adjusts. Internally the network emits one raw score (a logit logits The raw, pre-softmax scores the model produces — one per vocabulary token, per position. Bigger logit = the model finds that token more likely; the actual value can be any real number, positive or negative. Applying softmax across the vocabulary turns logits into a probability distribution that sums to 1. Sampling then picks one token from that distribution. See in glossary → ) per vocabulary entry, and a softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → turns those logits into the probability distribution.
Because each token is predicted only from the tokens before it, this is a causal language model causal language model A model that predicts each token using only earlier tokens (never future ones). "Causal" because information flows strictly left to right. The GPT family are causal LMs (Language Models). See in glossary → : information flows strictly left to right. The probability of a whole document factorizes into a product of next-token probabilities:
This is the autoregressive autoregressive Generating one token at a time, where each new token is conditioned on every token that came before it. See in glossary → factorization. It is exact — just the chain rule of probability — and it is what makes next-token prediction next-token prediction The pre-training objective for GPT-style models: given the tokens so far, predict a probability distribution over the next token. Also called causal or autoregressive language modeling. See in glossary → a complete model of text rather than a heuristic.
Scoring a prediction: cross-entropy
We need a loss function loss function A single number measuring how wrong the model's predictions are on a batch of data. Training works by adjusting parameters to make this number smaller. See in glossary → that is small when the model put high probability on the token that actually came next, and large when it did not. The natural choice is the cross-entropy loss cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → — equivalently the negative log-likelihood negative log-likelihood Another name for the cross-entropy LM (Language Model) loss: −log of the probability the model gave to the correct token. Big when the model was confidently wrong, small when it was confidently right. See in glossary → of the correct token:
Reading it term by term:
- — the loss: the single number we want to make as small as possible. Lower means the model predicted the real text better.
- — the actual token at position (the ground truth sitting in the training document), out of the whole sequence .
- — shorthand for all the tokens before position , i.e. . This is the context the model is allowed to look at.
- — the probability the model assigned to the true token , given that context. The bar reads “given.” This is a single number between 0 and 1 — the one value we care about out of the model’s full distribution over the vocabulary.
- — the model’s parameters (its billions of weights). The subscript is a reminder that this probability depends on the current model; training changes to make the true tokens more probable.
- — the (natural) logarithm. Applied to a probability in it gives a number in : (perfect), and it dives toward as the probability approaches .
- — sum over every position from to , so every token in the sequence contributes one term.
- — the number of tokens in the sequence (its length).
- — divide by to turn that sum into an average per token, so the loss doesn’t simply grow with sequence length and is comparable across sequences.
- the leading (minus sign) — flips the sign. Since each is negative, negating makes a positive quantity where smaller is better — a penalty rather than a reward.
Put together: for each position we look up the probability the model assigned to the true next token, take its logarithm, negate it, and average over all positions. Maximizing the likelihood of the data is the same as minimizing this loss.
Why the logarithm? Because it makes the penalty grow without bound as the model’s probability for the truth approaches zero. A model that says “1% chance” for the token that actually appears pays ; one that says “0.001%” pays . Cross-entropy cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → punishes confident mistakes savagely and rewards honest uncertainty mildly — exactly the incentive you want.
Perplexity and bits-per-token: the same loss in friendlier units
The raw loss is in nats (natural-log units), which is hard to feel. Two reparametrizations make it intuitive.
- Perplexity perplexity The exponential of the cross-entropy loss — roughly "how many equally-likely tokens is the model choosing between?" Lower is better; a perplexity of 1 means perfect prediction. See in glossary → is . You can read it as “the model is as confused as if it were choosing uniformly among this many tokens.” A perplexity of 1 is perfect; early GPT-2-scale models on web text were in the low tens; today’s best are lower still.
- Bits-per-token bits-per-token Cross-entropy loss measured in bits (log base 2) instead of nats. A compression-flavored view: a better language model encodes the next token in fewer bits. See in glossary → is — the loss in base-2. This connects pre-training to compression: a better language model assigns shorter codes to real text. Training to minimize bits-per-token is, quite literally, learning to compress the corpus, and good compression demands understanding.
Teacher forcing: a trillion examples at once
There is one more efficiency trick baked into the objective. During training we don’t generate text and then grade it; we feed the model the true sequence and ask it, at every position simultaneously, to predict the next token. This is teacher forcing teacher forcing During training, feeding the model the true previous tokens (not its own guesses) at every position, so all next-token predictions in a sequence can be learned in parallel. See in glossary → . Thanks to the causal causal language model A model that predicts each token using only earlier tokens (never future ones). "Causal" because information flows strictly left to right. The GPT family are causal LMs (Language Models). See in glossary → mask (each position can only see earlier ones), a single forward pass over a length- sequence produces next-token predictions and therefore loss terms — all in parallel.
This is why a document is worth its length in training signal, and why the context length context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → we train at (say 4k or 8k tokens) directly sets how much supervision each sequence yields. We will see context length become a first-class design knob in the modern models.
That is the entire objective. Simple to state, and — given enough scale — astonishingly powerful. The next question is mechanical: given this loss, how do we actually change billions of parameters to reduce it? That is gradient descent and backpropagation.