Section 02

The objective

Next-token prediction and cross-entropy loss

Everything a language model learns in pre-training is squeezed through one narrow objective: predict the next token. This chapter makes that objective precise — what the model outputs, how we score it, and why this particular scoring rule is the right one.

The model is a next-token probability machine

A language model takes a sequence of tokens and outputs a probability distribution over what the next token will be. Concretely, given a context $x_1, x_2, \ldots, x_{t-1}$ , it produces

p_\theta(x_t \mid x_1, \ldots, x_{t-1})

a number for every entry in the vocabulary (often 100k–256k entries), all non-negative and summing to 1. The subscript $\theta$ (the Greek letter theta) is the model’s parameters — the thing training adjusts. Internally the network emits one raw score (a logit ) per vocabulary entry, and a softmax turns those logits into the probability distribution.

Because each token is predicted only from the tokens before it, this is a causal language model : information flows strictly left to right. The probability of a whole document factorizes into a product of next-token probabilities:

p_\theta(x_1, \ldots, x_T) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{<t})

This is the autoregressive factorization. It is exact — just the chain rule of probability — and it is what makes next-token prediction a complete model of text rather than a heuristic.

Scoring a prediction: cross-entropy

We need a loss function that is small when the model put high probability on the token that actually came next, and large when it did not. The natural choice is the cross-entropy loss — equivalently the negative log-likelihood of the correct token:

\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})

Reading it term by term:

$\mathcal{L}$ — the loss: the single number we want to make as small as possible. Lower means the model predicted the real text better.
$x_t$ — the actual token at position $t$ (the ground truth sitting in the training document), out of the whole sequence $x_1, x_2, \ldots, x_T$ .
$x_{<t}$ — shorthand for all the tokens before position $t$ , i.e. $x_1, \ldots, x_{t-1}$ . This is the context the model is allowed to look at.
$p_\theta(x_t \mid x_{<t})$ — the probability the model assigned to the true token $x_t$ , given that context. The bar $\mid$ reads “given.” This is a single number between 0 and 1 — the one value we care about out of the model’s full distribution over the vocabulary.
$\theta$ — the model’s parameters (its billions of weights). The subscript is a reminder that this probability depends on the current model; training changes $\theta$ to make the true tokens more probable.
$\log$ — the (natural) logarithm. Applied to a probability in $(0,1]$ it gives a number in $(-\infty, 0]$ : $\log 1 = 0$ (perfect), and it dives toward $-\infty$ as the probability approaches $0$ .
$\sum_{t=1}^{T}$ — sum over every position $t$ from $1$ to $T$ , so every token in the sequence contributes one term.
$T$ — the number of tokens in the sequence (its length).
$\tfrac{1}{T}$ — divide by $T$ to turn that sum into an average per token, so the loss doesn’t simply grow with sequence length and is comparable across sequences.
the leading $-$ (minus sign) — flips the sign. Since each $\log p$ is negative, negating makes $\mathcal{L}$ a positive quantity where smaller is better — a penalty rather than a reward.

Put together: for each position we look up the probability the model assigned to the true next token, take its logarithm, negate it, and average over all positions. Maximizing the likelihood of the data is the same as minimizing this loss.

Why the logarithm? Because it makes the penalty grow without bound as the model’s probability for the truth approaches zero. A model that says “1% chance” for the token that actually appears pays $-\log(0.01) \approx 4.6$ ; one that says “0.001%” pays $\approx 11.5$ . Cross-entropy punishes confident mistakes savagely and rewards honest uncertainty mildly — exactly the incentive you want.

Cross-entropy loss, one token at a time

Prompt: "The cat sat on the ___". Choose what the model predicts and how confident it is, and watch the loss it pays for the truth.

The correct next token is mat.

Select which token the model has predicted as being next

✓ marks the correct token.

Model confidence (how peaked its distribution is) — the model bet on the right token — tiny loss

unsure (flat)confident (peaked)

Model's predicted distribution

✓mat

88.0%

sofa

8.4%

floor

2.6%

table

0.8%

roof

0.2%

Cross-entropy loss

0.128

= −ln(p of "mat")

Perplexity

1.14

= e^loss

Bits-per-token

0.184

= loss / ln 2

The loss only looks at the probability the model gave the true token "mat". When the model bets confidently on the wrong token, it leaves "mat" almost no probability, so −ln(p) shoots toward infinity. Being confidently wrong costs far more than being unsure — that asymmetry is what teaches the model to be calibrated rather than reckless.

Perplexity and bits-per-token: the same loss in friendlier units

The raw loss is in nats (natural-log units), which is hard to feel. Two reparametrizations make it intuitive.

Perplexity is $e^{\mathcal{L}}$ . You can read it as “the model is as confused as if it were choosing uniformly among this many tokens.” A perplexity of 1 is perfect; early GPT-2-scale models on web text were in the low tens; today’s best are lower still.
Bits-per-token is $\mathcal{L}/\ln 2$ — the loss in base-2. This connects pre-training to compression: a better language model assigns shorter codes to real text. Training to minimize bits-per-token is, quite literally, learning to compress the corpus, and good compression demands understanding.

Teacher forcing: a trillion examples at once

There is one more efficiency trick baked into the objective. During training we don’t generate text and then grade it; we feed the model the true sequence and ask it, at every position simultaneously, to predict the next token. This is teacher forcing . Thanks to the causal mask (each position can only see earlier ones), a single forward pass over a length- $T$ sequence produces $T$ next-token predictions and therefore $T$ loss terms — all in parallel.

This is why a document is worth its length in training signal, and why the context length we train at (say 4k or 8k tokens) directly sets how much supervision each sequence yields. We will see context length become a first-class design knob in the modern models.

That is the entire objective. Simple to state, and — given enough scale — astonishingly powerful. The next question is mechanical: given this loss, how do we actually change billions of parameters to reduce it? That is gradient descent and backpropagation.