Section 18

DeepSeek-V3

MoE, MLA, MTP, and FP8 at scale

Paper: DeepSeek-V3 Technical Report — DeepSeek-AI, 2024

If Llama 3 is the modern dense baseline, DeepSeek-V3 (DeepSeek-AI, 2024) is the modern efficiency masterclass. It trained a model with 671 billion total parameters to frontier quality for under $6 million of compute — a result that reset expectations for what a training run costs. Almost every part of that came from rethinking architecture and numerics together. This is the densest chapter; take it slowly.

Mixture-of-Experts: huge capacity, small active compute

DeepSeek-V3 is a Mixture-of-Experts model. Of its 671B parameters, only 37B are active per token — a router sends each token to a small subset of experts, so compute tracks the 37B, not the 671B. You get the knowledge capacity of a giant model at the FLOP cost of a medium one.

DeepSeek’s specific design, DeepSeekMoE, has two ingredients:

Fine-grained experts. Many small experts instead of a few big ones, giving the router more precise specialization.
Shared experts . A couple of experts every token always uses, which absorb common knowledge so the routed experts are free to specialize.

Mixture-of-Experts routing

Each token goes to a shared expert plus its top-k routed experts. Huge total capacity, small active compute.

Pick a token to trace its routing

shared expert (always on)

every token passes through this

Bars show how many of the 12 tokens each expert received (load). Highlighted experts are the ones token "cat" was routed to. Uneven bars = the load-balancing problem.

Routed experts = 16Experts per token (top-k) = 2

Total experts

Active per token

Total ÷ active params

3.0×

The router activates only 3 of 17 experts per token, so the model holds far more knowledge than it spends compute on. DeepSeek-V3 takes this to an extreme: 671B total parameters, but only 37B active per token — roughly an 18× gap. The price is the routing machinery and keeping every expert evenly loaded.

Play with the router above and you’ll feel the core tension of MoE: as you add experts, total capacity grows but each expert sees fewer tokens, and the load across experts gets uneven. An overloaded expert (and the GPU holding it under expert parallelism ) becomes a bottleneck.

Auxiliary-loss-free load balancing

The usual fix for uneven routing is an load-balancing auxiliary loss that pushes the router toward even usage — but that extra loss term fights the real objective and slightly degrades quality. DeepSeek-V3 pioneers an auxiliary-loss-free scheme: it nudges a per-expert routing bias up or down to even out load, with no extra loss term tugging on the model. Better balance, no quality tax. It’s a small idea with an outsized practical effect.

Multi-head Latent Attention: a tiny KV cache

DeepSeek’s second architectural lever is Multi-head Latent Attention (MLA). Standard attention caches a key and value vector for every head at every position — the KV cache that dominates memory at long context. MLA instead compresses the keys and values into a single small low-rank latent vector per token, from which the per-head keys and values are reconstructed on the fly. The cache stores the latent, not all the heads.

KV-cache footprint: MHA vs GQA vs MLA

Same 80-layer, 64-head model. How much memory the keys and values take, by attention design.

MHA

640.0 GB

keys + values for all 64 heads

GQA

80.0 GB

KV shared across 8 groups

MLA

22.5 GB

one low-rank latent per token

Sequence length = 32,768Concurrent sequences = 8

GQA shrinks MHA by

8×

MLA shrinks MHA by

28×

The KV cache grows with sequence length, batch, and layers — at long context it can dwarf the model weights. GQA (used by Llama 3, Qwen, the Gemmas) cuts it by sharing keys/values across head groups. MLA (DeepSeek) goes further, caching one compressed latent per token. Smaller KV means longer context and bigger batches fit in memory — a training and inference win. (Illustrative 70B-class config, BF16.)

The widget compares the three attention designs we’ve now met. GQA (Llama, Qwen, Gemma) shrinks the KV cache by sharing key/value heads; MLA shrinks it much further by storing one compressed latent. A smaller KV cache means longer context and bigger batches fit in memory — which helps both training throughput and inference cost.

Multi-Token Prediction

The third change is to the objective itself. Alongside ordinary next-token prediction, DeepSeek-V3 trains with Multi-Token Prediction (MTP): at each position the model also predicts a couple of further future tokens, via small extra prediction heads. This densifies the training signal (more to learn per position) and, as a bonus, the extra heads make speculative decoding faster at inference time.

FP8 training, validated at scale

Finally, the numerics. DeepSeek-V3 is the first model to demonstrate FP8 mixed-precision training on an extremely large run. Recall from the precision chapter that FP8’s range is tiny; DeepSeek’s answer is fine-grained scaling — separate scaling factors for small tiles/blocks of each tensor — plus keeping the most sensitive accumulations in higher precision. The payoff is roughly double the throughput of BF16 and half the memory, which is a big part of how the run came in so cheap.

DeepSeek-V3 is the template for efficient frontier pre-training: sparse compute (MoE), a tiny KV cache (MLA), a denser objective (MTP), and aggressive numerics (FP8). We’ll see its ideas echo through the 2026 frontier. But first, two more open models — Qwen and Gemma — each with a different emphasis.