Section 17

Llama 3

15 trillion tokens and a data engine

Paper: The Llama 3 Herd of Models — Grattafiori et al., 2024

We now leave history and enter the present. From here on, every chapter is a recent production model, and — following our plan — we’ll explain each shared technique once and then spotlight only what each new paper actually changes. Meta’s Llama 3 (Grattafiori et al., 2024) is the ideal starting point: a clean, dense, openly-documented model whose report reads like a checklist of modern pre-training practice. Its big themes are data at industrial scale and training small models far past compute-optimal.

The data engine: 15 trillion tokens, deliberately mixed

Llama 3’s flagship is a 405-billion-parameter dense transformer trained on 15.6 trillion tokens — roughly 50× GPT-3’s data. But the headline number isn’t the point; the curation is. Two ideas from our foundations chapters appear here in their mature form:

Scaling-law-driven data mix. Rather than guess the data mixture , Meta trained many small models on candidate mixes and used scaling laws to predict which mix would be best at 405B scale. The winning recipe: roughly 50% general knowledge, 25% math and reasoning, 17% code, 8% multilingual. Data composition became a quantitative optimization, not a hunch.
Model-based quality filtering. Beyond heuristics, Llama 3 used trained classifiers (including knowledge classifiers) to score and down-sample low-value web data, plus a custom HTML parser tuned to extract clean text. The data funnel from chapter 8, industrialized.

The architecture is deliberately boring

Llama 3 makes a point of not innovating on architecture: a standard dense, pre-norm transformer so that scale and data can be the variables. The few choices worth naming are the modern defaults we’ll reuse throughout:

Grouped-Query Attention (GQA) with 8 key-value heads — many query heads share a smaller set of key/value heads, shrinking the KV cache (we’ll quantify this next chapter) with little quality loss.
Rotary Position Embedding (RoPE) for positions, with the base frequency increased to support long context.
A 128K-token vocabulary (built on tiktoken plus extra non-English tokens) and a context window extended in a final stage to 128K tokens.

The compute-optimal twist: over-train the small ones

Here’s Llama 3’s most instructive pre-training decision. The 405B flagship is roughly compute-optimal in the Chinchilla sense. But the 8B and 70B models are trained on the same ~15T tokens — wildly past the ~20-tokens-per-parameter rule (8B on 15T is nearly 1,900 tokens per parameter). Why “waste” the compute?

Because Chinchilla optimizes training cost, and a deployed model also costs to run. As we flagged last chapter, it’s often worth over-training a smaller model: spend extra training compute now to get a model that is permanently cheaper at inference.

Train once, serve forever: why over-train?

Two models of similar quality. The big one is compute-optimal; the small one is over-trained. Total cost depends on how much you'll serve.

Lifetime tokens served = 1.0e+13 → cheaper overall: Over-trained 8B

Compute-optimal 70B

train 5.9e+23 · serve 2N/token

total 1.99e+24

Over-trained 8B

train 7.2e+23 · serve 2N/token

total 8.80e+23

The over-trained 8B costs more to train (15T tokens is far past compute-optimal) but only ~2×8B FLOPs per token to run, versus ~2×70B for the big model. Past the crossover (~1e+12 tokens), the smaller model wins on total cost — and most deployed models serve far more than that. That is exactly why Llama 3's smaller models are trained on 15T tokens. (Quality assumed equal for illustration.)

The widget makes the logic concrete. The over-trained small model costs more to train but far less per token to serve; past a crossover in lifetime volume — which any widely-deployed model blows past — it wins on total cost. Llama 3’s smaller models are the embodiment of this trade.

And the systems, briefly

The 405B run used up to 16,000 H100 GPUs and BF16 precision, scaled with 4D parallelism (tensor, pipeline, context, and data parallelism — context parallelism added to handle the 128K sequences). This is the parallelism chapter at full production scale: nothing exotic, just every axis turned on at once and tuned for utilization.

Llama 3 is the modern dense baseline. The next model keeps the data discipline but rebuilds the architecture from the ground up for efficiency — and it’s the most technically dense report in this explainer.