System Pillars

Every serious inference engine is a stack of decisions across the same seven concerns. They aren't optional — every system has an answer for each, even if the answer is "the default."

1. Kernel implementations

The actual GPU code: matmuls, softmax, layernorm. CUDA + cuBLAS, increasingly Triton for fused kernels. A poorly-tuned kernel leaves 80% of arithmetic throughput unused. A well-tuned one (FlashAttention) delivers order-of-magnitude wins without touching the model.

2. KV cache management

Unique to LLMs. During decoding, every previous token's K, V must be kept so the next token's attention can read them. The cache grows linearly with sequence length, is the dominant memory cost of LLM serving (often bigger than weights), and determines how many concurrent users fit on one GPU. See KV cache.

3. Memory management

Two consumers compete for VRAM: weights (fixed) and KV cache (variable per user). Quantization is the main tool — INT8/INT4 weights, FP8 cache. Anything that shrinks either consumer frees budget for more users.

4. Execution strategy

How a request moves through the model:

Prefill — process the whole prompt in one parallel pass. Compute-bound, batches well.
Decode — generate one token at a time. Memory-bound, dominates long-generation cost.

Modern engines often disaggregate these phases — different resource profiles, mixing them wastes both.

5. Scheduling and batching

The runtime loop that decides which requests join the next forward pass.

Dynamic batching — wait a few ms, gather what's queued, fire one batch.
Continuous batching — merge new requests into an already-running batch at each step.

Trade-off: longer batching window → higher throughput, worse TTFT.

6. Quantization

Fewer bits than training precision. FP32 → FP16/BF16 is almost free (2× memory savings). INT8/INT4 needs real accuracy work, and kernel support is the bottleneck — a quantization format only matters if the kernels exist.

Triangle: accuracy ↔ memory ↔ kernel support. Optimize for two at a time, max.

7. Hardware

Single GPU is the cheapest happy path; most models under ~30B fit with quantization. Multi-GPU options: tensor parallelism (shards each layer, good for latency on big models) and pipeline parallelism (shards layers, good for throughput).

The hardware layer determines which strategies above are even available.