Throughput vs latency
What knobs move what
Throughout this essay we’ve kept saying “this knob trades throughput for latency.” This is the section where we make that explicit and look at the actual knobs a vLLM operator turns to hit an SLO SLO Service Level Objective — a target like “p99 TTFT < 1 s”. Serving systems are tuned to maximize throughput subject to SLOs. See in glossary → .
The roofline you’re up against
A serving system on a single GPU has two hard ceilings:
- Compute ceiling: the GPU’s peak FLOPS (~990 TFLOPS BF16 on H100). Hit at very large batches and during prefill.
- Memory-bandwidth ceiling: the GPU’s HBM bandwidth (~3.35 TB/s on H100). Hit during decode at every batch size.
Plot throughput (tokens/sec) against batch size, and you get a classic roofline shape: throughput rises nearly linearly until you hit the compute ceiling, then plateaus. The “saturation batch size” — the batch at which the curve bends — is where you’re most efficient.
For Llama-3-8B on H100, the saturation batch is around 64–128 tokens-per-step. Below that, you’re leaving compute on the table. Above that, throughput goes flat but per-token latency rises (each step takes longer).
The knobs
vLLM exposes these as flags on vllm serve:
| Knob | Direction | Trades |
|---|---|---|
--max-num-seqs | up | More concurrency → higher throughput, more KV memory, more ITL |
--max-num-batched-tokens | up | Bigger per-step batches → higher throughput, more ITL |
--enable-chunked-prefill | on | Smooths ITL during long prompts at slight throughput cost |
--max-chunked-prefill-tokens | bigger | Faster prefill, more decode-disruption |
--enable-prefix-caching | on | Big TTFT and throughput wins for shared-prompt workloads |
--gpu-memory-utilization | up | Bigger KV pool → more concurrent requests, less safety margin |
--block-size | bigger | Less per-page overhead, more internal fragmentation |
--num-speculative-tokens (K) | up | Speculation; helps if acceptance is high, hurts otherwise |
--speculative-model | enabled | Speculation; net win if draft is well-matched |
--tensor-parallel-size | up | More GPUs per replica → bigger models, more cross-GPU sync |
--pipeline-parallel-size | up | More GPUs across pipeline → bigger models, more pipeline bubbles |
--quantization | aggressive (e.g. fp8, int4) | Smaller weights + KV → more concurrent requests, some quality hit |
The combinatorics here are vicious. vLLM ships an auto-tuner that sweeps a configurable subset of these against a target workload and a target SLO, finds the pareto-frontier configs, and reports them.
The three numbers you actually report
Any serving SLO comes down to three numbers:
- TTFT p99 — what’s the worst time-to-first-token among the top 1% slowest requests? This is what users feel when they press Enter.
- ITL p99 — same for inter-token latency. Choppy streaming is way worse than slow-but-steady streaming.
- Throughput — total tokens generated per second across all in-flight requests. This is what you ratio against GPU cost to compute $ / token.
A common target: “p99 TTFT < 1 s, p99 ITL < 50 ms, maximize throughput subject to those.”
Two stylized workloads, two configs
To make the tradeoffs concrete, contrast two services running the same Llama-3-70B model.
A: Code completion — short prompts (~200 tokens), short completions (~50 tokens), high request rate, very tight latency.
- Small batch sizes (tight latency).
- Aggressive prefix caching (header files, project context repeat).
- Chunked prefill on (don’t let any one prompt dominate).
- Speculation off (small K wouldn’t recover the draft cost at low batch sizes).
- KV cache moderate, lots of free pages so admissions are fast.
B: Document analysis — long prompts (~16k tokens), long completions (~2k tokens), low request rate, latency-tolerant.
- Big batch sizes (KV cache is the bottleneck per-request anyway).
- Chunked prefill on (16k prompts must be split).
- Speculation on with EAGLE draft (per-request throughput is the metric).
- KV cache mostly full all the time; preempt-and-swap on.
- Maybe TP=2 to fit longer contexts in HBM.
Same model. Same engine. Different configs because they have different shapes.
Quantization, briefly
The other lever we haven’t discussed is quantization: storing weights (and optionally KV) in fewer bits per number.
- FP16 / BF16: 2 bytes/param. The default. Llama-3-70B = 140 GB.
- FP8: 1 byte/param. Almost no quality loss on most workloads. 70B = 70 GB (fits on a single H100!).
- INT4 (AWQ, GPTQ, GGUF Q4_K_M): 0.5 byte/param. Some quality cost, big throughput win because the memory traffic halves again.
- KV cache quantization: FP8 or even INT4 KV. The cache fits 2-4× more concurrent requests for the same HBM.
Lower-precision formats are the single most cost-effective serving optimization left for most teams; modern H100/B200 chips have hardware support for FP8 throughout the tensor cores, so there’s no math cost.
What we did not cover
This essay’s playbook is the mainline. A few important topics we skipped to keep the length manageable:
- FlashAttention — the fused attention kernel that does and softmax in a single pass, never materializing the full attention matrix. Used everywhere, including under the hood of paged attention.
- CUDA graphs — a way to record a sequence of kernel launches and replay them with low per-launch overhead. vLLM uses these in decode for the most common batch shapes.
- Mixture-of-Experts (MoE) routing — DeepSeek, Mixtral, etc. activate only a subset of experts per token. The serving story for MoE has its own set of tricks.
- Disaggregated P/D at scale — already mentioned in §17; the production engineering of a separated prefill cluster + decode cluster is its own subfield.
The recap and pointers to further reading are next.