Serverless GPUs & Cold Starts
How a 33-minute cold start gets reduced to ~50 seconds.
Inference traffic is spiky. You don't control when users send requests. Two ways to handle it: over-provision (wasteful) or auto-scale fast enough (hard — this is the whole problem). The hard path is the only one that scales economically.
To take it, cold starts — time from "spin up replica" to "ready to serve" — must be small enough to react to spikes in seconds.
This page walks through how a real platform (Modal's public writeup) takes a naive ~2000 s cold start down to ~50 s by stacking four layers, each compounding on the last.
The metric
GPU Allocation Utilization = useful GPU-seconds / paid GPU-secondsMost orgs run 10–20%. The fix isn't buying fewer GPUs — it's spinning new ones up fast enough that you don't need a giant idle buffer as insurance.
The four layers
1. Cloud buffer — removes tens of minutes
A pool of idle, health-checked GPUs always ready. Booting a VM + health checks takes 10–30 min. By doing it ahead of time and keeping a buffer warm, this cost is off the hot path. The buffer means some GPUs are always idle — a deliberate cost. 100% utilization is fragile.
2. Custom filesystem (ImageFS) — saves ~1 minute
Container images are huge (PyTorch + CUDA + your code = GBs). A lazy-loading, content-addressed, tiered filesystem fixes it:
- Lazy load — only the index loads upfront. Files fetched on demand or never.
- Content addressing — files stored by hash. Every container's
torch/is the same bytes, cached once. - Tiered cache — host RAM → local SSD → object storage.
After the first container on a host, subsequent ones skip the network for shared deps.
3. CPU memory snapshots — ~10× speedup on host startup
import torch triggers thousands of lines of Python and tens of thousands of OS calls. Do that work once, freeze process memory to disk, restore directly into the post-init state on every cold start.
Split work between snapshotted vs. re-run-each-time (network connections, anything stateful):
@app.cls(enable_memory_snapshot=True)
class InferenceService:
@modal.enter(snap=True) # runs once, snapshot here
def startup(self):
import torch
self.model = load_model()
@modal.enter(snap=False) # re-runs every cold start
def finalize(self):
self.connect_to_db()Easier when the container runtime is gVisor — the entire kernel state lives in one Go program, so serializing it is straightforward. Snapshots are CPU-architecture specific (need one per ISA on heterogeneous fleets).
4. GPU memory snapshots — 4–10× speedup on device startup
The remaining cost is inference engine setup: CUDA graph capture, Torch compilation, KV cache allocation. Tens of seconds to minutes per fresh GPU.
CUDA graphs are recorded GPU op sequences replayed as one unit. Capturing requires a model dry-run; the result lives in VRAM as raw pointers — no native disk serialization.
The trick: don't serialize objects, save the entire GPU memory space byte-for-byte.
Snapshot: driver copies VRAM → CPU RAM → ImageFS
Restore: load → CPU RAM → driver pushes back to VRAMEverything lands at the same address, so all internal pointers stay valid automatically. Weights are huge but cheap to reload — offload them before snapshotting, snapshot only the graphs, reload weights after. Smaller snapshot, same wins.
Today works cleanly for single-GPU only. Multi-GPU NCCL setups deadlock on checkpoint pause.
How the layers stack
Baseline: ~2000 s
- Cloud Buffer: off the hot path tens of minutes
- ImageFS: lazy + content-addressed ~1 minute
- CPU Snapshots: skip Python/lib init ~10× host startup
- GPU Snapshots: skip CUDA graph capture 4–10× device startup
Result: ~50 sEach layer depends on the one below. GPU snapshots are delivered via ImageFS. CPU snapshots make GPU snapshots tractable.
Takeaways
- Cold start is the enemy of serverless inference. Long cold start → big idle buffer → poor utilization → high cost.
- The bottleneck moves as you optimize. Fix VM spin-up → filesystem → Python init → GPU setup.
- Snapshotting is about skipping recomputation, not caching data. The value is storing CUDA graphs that have no other serialization path.
Source: Modal — Truly Serverless GPUs.