Layers of Inference Engineering

"Inference engine" can mean four different things. Splitting them apart is the fastest way to locate a problem.

Runtime

Executes the model on a single machine. vLLM, SGLang, TensorRT-LLM — frameworks on PyTorch/CUDA that handle batching, KV cache, quantization, speculative decoding, tensor parallelism, prefill/decode disaggregation.

If tokens-per-second per GPU is wrong, look here first.

Infrastructure

Turns one running model into a fleet. Kubernetes, autoscalers, routers, load balancers. Spinning up replicas fast enough to absorb spikes, routing to a replica that already has the right model loaded, controlling cost.

The runtime makes a single GPU fast. Infrastructure makes a thousand GPUs useful.

Tools

Everything around the model: prompt management, context compaction, agent harnesses, eval pipelines. The model is rarely the whole product — it's a component in a larger loop.

Orchestration

When a product depends on multiple models or pipelines — router + retrieval + vision + reasoning — orchestration ties them together. Goal: reduce end-to-end latency across the chain and keep deployment complexity manageable.

Why split it this way?

Because the root cause of "slow" or "expensive" almost always lives at exactly one layer. Single-request latency on a warm GPU → runtime. Cold starts take minutes → infrastructure. Agent loop is slow but outputs are right → tools/orchestration.

The rest of the Fundamentals pages live in the runtime layer, except Serverless GPUs, which is infrastructure.