Section 06

Compute & memory

FLOPs, the 6ND rule, and where the GBs go

Pre-training is bounded by two finite resources: compute (how many arithmetic operations you can do) and memory (how many numbers you can hold at once). Almost every architectural and systems decision in this explainer is, underneath, a negotiation with one of these two limits. This chapter makes both concrete.

Compute: the 6ND rule

Training cost is measured in FLOPs — floating-point operations. There is a beautifully simple estimate for how many a run takes. For a dense model with $N$ parameters trained on $D$ tokens:

C \approx 6ND \quad\text{floating-point operations}

The factor of 6 breaks down as roughly $2ND$ for the forward pass (each parameter does one multiply and one add per token) and about $4ND$ for the backward pass (computing gradients with respect to both activations and weights). This little formula, the 6ND rule , is the master equation of training economics — it’s what scaling laws optimize, and what cloud bills are made of.

A GPU never hits its peak FLOP rating in practice. The fraction it does achieve is Model FLOPs Utilization — real large-scale runs are typically in the 30–50% range. The gap is lost to communication, memory stalls, and pipeline gaps. Much of the systems engineering in the modern papers is a fight to push MFU up a few points, because at this scale a few points is millions of dollars.

Compute-bound vs memory-bound

Not every operation is limited by math throughput. Many are limited by memory bandwidth — how fast data moves between the GPU’s compute units and its memory. The ratio of compute to memory traffic is an operation’s arithmetic intensity ; low-intensity ops (like the elementwise work in normalization, or attention at short context) stall waiting on memory rather than maxing out the math units. This distinction, central to the inference side, also shapes which training kernels get fused and optimized.

Memory: where the gigabytes actually go

Compute sets how long a run takes; memory sets whether it fits at all. And here intuition misleads people: the weights are often the smallest part of the budget. For training with AdamW in mixed precision , each parameter drags along a whole entourage:

a BF16 working copy of the weight (2 bytes),
its BF16 gradient (2 bytes),
an FP32 master copy of the weight (4 bytes),
and Adam’s two optimizer states — the first and second moments, in FP32 (4 + 4 bytes).

That’s 16 bytes per parameter before you’ve stored a single activation. A 70B model is therefore over a terabyte of state — far more than any single GPU’s memory.

On top of that sit the activations — the intermediate tensors the forward pass produces and the backward pass consumes. Unlike the per-parameter costs, activation memory scales with batch size and sequence length, so at long context it can dominate everything else.

Training memory budget

Per-GPU memory to train with AdamW in mixed precision. Watch the optimizer states dwarf the weights — and the activations explode with sequence length.

Weights (BF16)13.0 GB

Gradients (BF16)13.0 GB

Master weights (FP32)26.1 GB

Adam moment m (FP32)26.1 GB

Adam moment v (FP32)26.1 GB

Activations8.0 GB

Sequences per GPU = 4Sequence length = 4,096

Gradient checkpointing (recompute activations in the backward pass)

Total per GPU

112.3 GB

vs one 80GB H100

needs sharding — ~2× GPUs

For a 70B or 405B model, weights are a small slice — the FP32 master copy and Adam's two moments are the giants, which is why ZeRO/FSDP shard them first. Crank the sequence length with checkpointing off and activations take over instead. No single GPU holds a large model; the next chapter is about splitting it across many. (Estimates, not a spec.)

Two levers in that widget are worth dwelling on. First, optimizer states are the giant for large models — which is exactly why the first thing you do when scaling out is shard them ( ZeRO / FSDP , next chapter). Second, gradient checkpointing — recomputing activations during the backward pass instead of storing them — trades a chunk of extra compute for a massive reduction in activation memory, and it’s standard practice at long context.