Section 06

Compute & memory

FLOPs, the 6ND rule, and where the GBs go

Pre-training is bounded by two finite resources: compute (how many arithmetic operations you can do) and memory (how many numbers you can hold at once). Almost every architectural and systems decision in this explainer is, underneath, a negotiation with one of these two limits. This chapter makes both concrete.

Compute: the 6ND rule

Training cost is measured in FLOPs FLOP Floating-Point Operation — one multiply or add. Training compute is measured in total FLOPs; a frontier run is on the order of 10^24–10^25 FLOPs. See in glossary → — floating-point operations. There is a beautifully simple estimate for how many a run takes. For a dense model with NN parameters trained on DD tokens:

C6NDfloating-point operationsC \approx 6ND \quad\text{floating-point operations}

The factor of 6 breaks down as roughly 2ND2ND for the forward pass forward pass Running inputs through the network to produce outputs (logits) and the loss, caching intermediate activations that backpropagation will need. See in glossary → (each parameter does one multiply and one add per token) and about 4ND4ND for the backward pass backward pass The second half of a training step: backpropagation walks from the loss back through the network, computing each parameter's gradient. See in glossary → (computing gradients with respect to both activations and weights). This little formula, the 6ND rule 6ND rule A rule of thumb: training a dense model with N parameters on D tokens costs about 6ND floating-point operations (≈2ND forward + ≈4ND backward). See in glossary → , is the master equation of training economics — it’s what scaling laws optimize, and what cloud bills are made of.

A GPU never hits its peak FLOP rating in practice. The fraction it does achieve is Model FLOPs Utilization MFU Model FLOPs Utilization — the fraction of a GPU's peak floating-point throughput actually used for useful model math. Real large-scale runs often land around 30–50%. See in glossary → — real large-scale runs are typically in the 30–50% range. The gap is lost to communication, memory stalls, and pipeline gaps. Much of the systems engineering in the modern papers is a fight to push MFU up a few points, because at this scale a few points is millions of dollars.

Memory: where the gigabytes actually go

Compute sets how long a run takes; memory sets whether it fits at all. And here intuition misleads people: the weights are often the smallest part of the budget. For training with AdamW AdamW Adam with decoupled Weight decay — the de facto standard LLM optimizer. It applies weight decay directly to the parameters instead of folding it into the gradient, which regularizes more cleanly. See in glossary → in mixed precision mixed-precision training Doing the heavy matrix multiplies in a low-precision format (BF16/FP8) for speed while keeping a high-precision (FP32) copy of the weights and accumulating sensitive sums in FP32 for stability. See in glossary → , each parameter drags along a whole entourage:

  • a BF16 working copy of the weight (2 bytes),
  • its BF16 gradient (2 bytes),
  • an FP32 master copy of the weight (4 bytes),
  • and Adam’s two optimizer states optimizer states Extra per-parameter values an optimizer maintains — for Adam, the first and second moment estimates. In FP32 these add 8 bytes per parameter, often dwarfing the weights themselves. See in glossary → — the first and second moments, in FP32 (4 + 4 bytes).

That’s 16 bytes per parameter before you’ve stored a single activation. A 70B model is therefore over a terabyte of state — far more than any single GPU’s memory.

On top of that sit the activations activations The intermediate tensors produced during the forward pass. They must be kept around for the backward pass, and at long context they can dominate memory use. See in glossary → — the intermediate tensors the forward pass produces and the backward pass consumes. Unlike the per-parameter costs, activation memory scales with batch size and sequence length, so at long context context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → it can dominate everything else.

Training memory budget
Per-GPU memory to train with AdamW in mixed precision. Watch the optimizer states dwarf the weights — and the activations explode with sequence length.
Weights (BF16)13.0 GB
Gradients (BF16)13.0 GB
Master weights (FP32)26.1 GB
Adam moment m (FP32)26.1 GB
Adam moment v (FP32)26.1 GB
Activations8.0 GB
Total per GPU
112.3 GB
vs one 80GB H100
needs sharding — ~2× GPUs
For a 70B or 405B model, weights are a small slice — the FP32 master copy and Adam's two moments are the giants, which is why ZeRO/FSDP shard them first. Crank the sequence length with checkpointing off and activations take over instead. No single GPU holds a large model; the next chapter is about splitting it across many. (Estimates, not a spec.)

Two levers in that widget are worth dwelling on. First, optimizer states are the giant for large models — which is exactly why the first thing you do when scaling out is shard them ( ZeRO ZeRO Zero Redundancy Optimizer — a family of techniques that shard optimizer states, gradients, and optionally parameters across data-parallel GPUs so no device holds a full redundant copy. See in glossary → / FSDP FSDP Fully Sharded Data Parallel — PyTorch's implementation of ZeRO-style sharding: each GPU stores a shard of the parameters and gathers the rest just in time for each layer's compute. See in glossary → , next chapter). Second, gradient checkpointing gradient checkpointing Activation recomputation — saving memory by discarding most activations in the forward pass and recomputing them during the backward pass, trading extra compute for far less memory. See in glossary → — recomputing activations during the backward pass instead of storing them — trades a chunk of extra compute for a massive reduction in activation memory, and it’s standard practice at long context.