Fundamentals
What is an Inference Engine?
The system that takes a trained model and runs it efficiently on real hardware.
An inference engine runs a trained model efficiently on hardware — CPUs, GPUs, TPUs, NPUs. Training produces weights and a computation graph; inference executes that graph billions of times in production.
It takes the model's ops and runs them using optimized kernels, smart memory management, and hardware-specific tricks — aiming for low latency and high throughput.
A naive model.forward(x) leaves a lot on the table: every op launches its own kernel, the KV cache fragments, requests aren't batched, and weights run in full precision. The engine closes that gap.
What this section covers
LayersRuntime, infrastructure, tools, orchestration.Latency vs ThroughputThe two metrics every optimization trades between.System pillarsKernels, KV cache, memory, scheduling, quantization, hardware.KV cacheWhy decoding LLMs is memory-bound.FlashAttentionTiling attention via online softmax.Serverless GPUsCold starts from 30 minutes to 50 seconds.
If you just want to call the API, start with Getting started — none of this is required.