Section 17

Llama 3

15 trillion tokens and a data engine

Paper: The Llama 3 Herd of Models — Grattafiori et al., 2024

We now leave history and enter the present. From here on, every chapter is a recent production model, and — following our plan — we’ll explain each shared technique once and then spotlight only what each new paper actually changes. Meta’s Llama 3 (Grattafiori et al., 2024) is the ideal starting point: a clean, dense, openly-documented model whose report reads like a checklist of modern pre-training practice. Its big themes are data at industrial scale and training small models far past compute-optimal.

The data engine: 15 trillion tokens, deliberately mixed

Llama 3’s flagship is a 405-billion-parameter dense transformer trained on 15.6 trillion tokens — roughly 50× GPT-3’s data. But the headline number isn’t the point; the curation is. Two ideas from our foundations chapters appear here in their mature form:

  • Scaling-law-driven data mix. Rather than guess the data mixture data mixture The recipe specifying what fraction of training tokens comes from each source (web, code, books, math, multilingual). Tuning the mixture is one of the highest-leverage data decisions. See in glossary → , Meta trained many small models on candidate mixes and used scaling laws scaling laws Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments. See in glossary → to predict which mix would be best at 405B scale. The winning recipe: roughly 50% general knowledge, 25% math and reasoning, 17% code, 8% multilingual. Data composition became a quantitative optimization, not a hunch.
  • Model-based quality filtering. Beyond heuristics, Llama 3 used trained classifiers (including knowledge classifiers) to score and down-sample low-value web data, plus a custom HTML parser tuned to extract clean text. The data funnel from chapter 8, industrialized.

The architecture is deliberately boring

Llama 3 makes a point of not innovating on architecture: a standard dense, pre-norm pre-norm Placing the normalization layer before each sub-layer (inside the residual branch) rather than after it. Pre-norm transformers are far more stable to train at depth, and became standard after GPT-2. See in glossary → transformer so that scale and data can be the variables. The few choices worth naming are the modern defaults we’ll reuse throughout:

  • Grouped-Query Attention GQA Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss. See in glossary → (GQA) with 8 key-value heads — many query heads share a smaller set of key/value heads, shrinking the KV cache (we’ll quantify this next chapter) with little quality loss.
  • Rotary Position Embedding RoPE Rotary Position Embeddings — rotates Q/K vectors by an angle proportional to position. Standard in modern LLMs. See in glossary → (RoPE) for positions, with the base frequency increased to support long context.
  • A 128K-token vocabulary vocabulary The fixed set of tokens a model knows about. Modern LLMs have ~32k–200k entries. See in glossary → (built on tiktoken plus extra non-English tokens) and a context window extended in a final stage to 128K tokens.

The compute-optimal twist: over-train the small ones

Here’s Llama 3’s most instructive pre-training decision. The 405B flagship is roughly compute-optimal compute-optimal The allocation of a fixed compute budget between model size and training tokens that minimizes loss. Chinchilla showed it means scaling both roughly equally — about 20 tokens per parameter. See in glossary → in the Chinchilla sense. But the 8B and 70B models are trained on the same ~15T tokens — wildly past the ~20-tokens-per-parameter rule (8B on 15T is nearly 1,900 tokens per parameter). Why “waste” the compute?

Because Chinchilla optimizes training cost, and a deployed model also costs to run. As we flagged last chapter, it’s often worth over-training over-training Deliberately training a model on far more tokens than the compute-optimal ~20 per parameter. It costs more training compute for a slightly better, much smaller model that is cheaper to run at inference. See in glossary → a smaller model: spend extra training compute now to get a model that is permanently cheaper at inference.

Train once, serve forever: why over-train?
Two models of similar quality. The big one is compute-optimal; the small one is over-trained. Total cost depends on how much you'll serve.
1e111e121e131e141e15lifetime tokens served (log)total FLOPs (train + serve)
Compute-optimal 70B
train 5.9e+23 · serve 2N/token
total 1.99e+24
Over-trained 8B
train 7.2e+23 · serve 2N/token
total 8.80e+23
The over-trained 8B costs more to train (15T tokens is far past compute-optimal) but only ~2×8B FLOPs per token to run, versus ~2×70B for the big model. Past the crossover (~1e+12 tokens), the smaller model wins on total cost — and most deployed models serve far more than that. That is exactly why Llama 3's smaller models are trained on 15T tokens. (Quality assumed equal for illustration.)

The widget makes the logic concrete. The over-trained small model costs more to train but far less per token to serve; past a crossover in lifetime volume — which any widely-deployed model blows past — it wins on total cost. Llama 3’s smaller models are the embodiment of this trade.

Llama 3 is the modern dense baseline. The next model keeps the data discipline but rebuilds the architecture from the ground up for efficiency — and it’s the most technically dense report in this explainer.