Latency vs Throughput
The two metrics every inference optimization trades between.
Almost every decision in an inference engine is a trade between two numbers. You can't reason about a change until you can say which one goes up and which goes down.
Definitions
Latency — time for a single request, in milliseconds. Felt by one user.
Throughput — work completed per unit time, in tokens/sec or req/sec. Felt by your GPU bill.
These are not the same metric. Most techniques improve one at the expense of the other.
Why both matter
- Lower latency → feels snappy.
- Higher throughput → cheaper per token at scale. A GPU serving 100 concurrent users instead of 10 cuts cost-per-user 10×.
A latency-only server with no batching is 5–20× more expensive per token. A throughput-only server with aggressive batching makes users wait in queues. Real systems pick a point.
Levers
Reduces latency: disaggregating prefill/decode, multi-GPU tensor parallelism, quantization (smaller weights move faster).
Increases throughput: batching (combine users into one forward pass — requires a small batching delay), better KV cache layout (fit more users on the same GPU).
Memory is what usually caps throughput. If you can't fit another request's KV cache, you can't batch it.
Metrics
TTFT — Time To First Token. Dominated by prefill (running the prompt through the model). What users feel as "the model is thinking."
TPS — Tokens Per Second. Two flavors:
- Per-user TPS — tokens streamed to one user. Drops as you batch more users together.
- System TPS — tokens across all in-flight users. Goes up as you batch more.
A healthy report includes both. Optimizing only one is how systems end up painfully slow or quietly expensive.