Layers of Inference Engineering
An inference stack is four layers, each with its own concerns.
"Inference engine" can mean four different things. Splitting them apart is the fastest way to locate a problem.
Runtime
Executes the model on a single machine. vLLM, SGLang, TensorRT-LLM — frameworks on PyTorch/CUDA that handle batching, KV cache, quantization, speculative decoding, tensor parallelism, prefill/decode disaggregation.
If tokens-per-second per GPU is wrong, look here first.
Infrastructure
Turns one running model into a fleet. Kubernetes, autoscalers, routers, load balancers. Spinning up replicas fast enough to absorb spikes, routing to a replica that already has the right model loaded, controlling cost.
The runtime makes a single GPU fast. Infrastructure makes a thousand GPUs useful.
Tools
Everything around the model: prompt management, context compaction, agent harnesses, eval pipelines. The model is rarely the whole product — it's a component in a larger loop.
Orchestration
When a product depends on multiple models or pipelines — router + retrieval + vision + reasoning — orchestration ties them together. Goal: reduce end-to-end latency across the chain and keep deployment complexity manageable.
Why split it this way?
Because the root cause of "slow" or "expensive" almost always lives at exactly one layer. Single-request latency on a warm GPU → runtime. Cold starts take minutes → infrastructure. Agent loop is slow but outputs are right → tools/orchestration.
The rest of the Fundamentals pages live in the runtime layer, except Serverless GPUs, which is infrastructure.