Section 18

DeepSeek-V3

MoE, MLA, MTP, and FP8 at scale

Paper: DeepSeek-V3 Technical Report — DeepSeek-AI, 2024

If Llama 3 is the modern dense baseline, DeepSeek-V3 (DeepSeek-AI, 2024) is the modern efficiency masterclass. It trained a model with 671 billion total parameters to frontier quality for under $6 million of compute — a result that reset expectations for what a training run costs. Almost every part of that came from rethinking architecture and numerics together. This is the densest chapter; take it slowly.

Mixture-of-Experts: huge capacity, small active compute

DeepSeek-V3 is a Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → model. Of its 671B parameters, only 37B are active active parameters In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number. See in glossary → per token — a router sends each token to a small subset of experts, so compute tracks the 37B, not the 671B. You get the knowledge capacity of a giant model at the FLOP cost of a medium one.

DeepSeek’s specific design, DeepSeekMoE, has two ingredients:

  • Fine-grained experts. Many small experts instead of a few big ones, giving the router more precise specialization.
  • Shared experts shared expert In DeepSeekMoE, an expert that every token always passes through (alongside a few routed experts), used to capture common knowledge so the routed experts can specialize. See in glossary → . A couple of experts every token always uses, which absorb common knowledge so the routed experts are free to specialize.
Mixture-of-Experts routing
Each token goes to a shared expert plus its top-k routed experts. Huge total capacity, small active compute.
Pick a token to trace its routing
shared expert (always on)
every token passes through this
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Bars show how many of the 12 tokens each expert received (load). Highlighted experts are the ones token "cat" was routed to. Uneven bars = the load-balancing problem.
Total experts
17
Active per token
3
Total ÷ active params
3.0×
The router activates only 3 of 17 experts per token, so the model holds far more knowledge than it spends compute on. DeepSeek-V3 takes this to an extreme: 671B total parameters, but only 37B active per token — roughly an 18× gap. The price is the routing machinery and keeping every expert evenly loaded.

Play with the router above and you’ll feel the core tension of MoE: as you add experts, total capacity grows but each expert sees fewer tokens, and the load across experts gets uneven. An overloaded expert (and the GPU holding it under expert parallelism expert parallelism Placing different experts of a Mixture-of-Experts layer on different GPUs, so each device holds only some experts and tokens are routed across the network to reach them. See in glossary → ) becomes a bottleneck.

Multi-head Latent Attention: a tiny KV cache

DeepSeek’s second architectural lever is Multi-head Latent Attention Multi-head Latent Attention Multi-head Latent Attention (MLA) — DeepSeek's attention variant that compresses the keys and values into a small shared low-rank latent vector, drastically shrinking the KV cache while keeping multi-head expressivity. See in glossary → (MLA). Standard attention caches a key and value vector for every head at every position — the KV cache that dominates memory at long context. MLA instead compresses the keys and values into a single small low-rank latent vector per token, from which the per-head keys and values are reconstructed on the fly. The cache stores the latent, not all the heads.

KV-cache footprint: MHA vs GQA vs MLA
Same 80-layer, 64-head model. How much memory the keys and values take, by attention design.
MHA
640.0 GB
keys + values for all 64 heads
GQA
80.0 GB
KV shared across 8 groups
MLA
22.5 GB
one low-rank latent per token
GQA shrinks MHA by
8×
MLA shrinks MHA by
28×
The KV cache grows with sequence length, batch, and layers — at long context it can dwarf the model weights. GQA (used by Llama 3, Qwen, the Gemmas) cuts it by sharing keys/values across head groups. MLA (DeepSeek) goes further, caching one compressed latent per token. Smaller KV means longer context and bigger batches fit in memory — a training and inference win. (Illustrative 70B-class config, BF16.)

The widget compares the three attention designs we’ve now met. GQA GQA Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss. See in glossary → (Llama, Qwen, Gemma) shrinks the KV cache by sharing key/value heads; MLA shrinks it much further by storing one compressed latent. A smaller KV cache means longer context and bigger batches fit in memory — which helps both training throughput and inference cost.

Multi-Token Prediction

The third change is to the objective itself. Alongside ordinary next-token prediction, DeepSeek-V3 trains with Multi-Token Prediction Multi-Token Prediction Multi-Token Prediction (MTP) — a training objective where the model predicts several future tokens at each position (not just the next one), densifying the learning signal and enabling faster speculative decoding later. See in glossary → (MTP): at each position the model also predicts a couple of further future tokens, via small extra prediction heads. This densifies the training signal (more to learn per position) and, as a bonus, the extra heads make speculative decoding faster at inference time.

FP8 training, validated at scale

Finally, the numerics. DeepSeek-V3 is the first model to demonstrate FP8 FP8 8-bit Floating Point (typically E4M3 or E5M2 layouts). The newest training precision, used on H100/Blackwell GPUs to roughly double throughput; needs careful scaling to stay numerically stable. See in glossary → mixed-precision mixed-precision training Doing the heavy matrix multiplies in a low-precision format (BF16/FP8) for speed while keeping a high-precision (FP32) copy of the weights and accumulating sensitive sums in FP32 for stability. See in glossary → training on an extremely large run. Recall from the precision chapter that FP8’s range is tiny; DeepSeek’s answer is fine-grained scaling — separate scaling factors for small tiles/blocks of each tensor — plus keeping the most sensitive accumulations in higher precision. The payoff is roughly double the throughput of BF16 and half the memory, which is a big part of how the run came in so cheap.

DeepSeek-V3 is the template for efficient frontier pre-training: sparse compute (MoE), a tiny KV cache (MLA), a denser objective (MTP), and aggressive numerics (FP8). We’ll see its ideas echo through the 2026 frontier. But first, two more open models — Qwen and Gemma — each with a different emphasis.