TensorLoop
Fundamentals

Serverless GPUs & Cold Starts

How a 33-minute cold start gets reduced to ~50 seconds.

Inference traffic is spiky. You don't control when users send requests. Two ways to handle it: over-provision (wasteful) or auto-scale fast enough (hard — this is the whole problem). The hard path is the only one that scales economically.

To take it, cold starts — time from "spin up replica" to "ready to serve" — must be small enough to react to spikes in seconds.

This page walks through how a real platform (Modal's public writeup) takes a naive ~2000 s cold start down to ~50 s by stacking four layers, each compounding on the last.

The metric

GPU Allocation Utilization = useful GPU-seconds / paid GPU-seconds

Most orgs run 10–20%. The fix isn't buying fewer GPUs — it's spinning new ones up fast enough that you don't need a giant idle buffer as insurance.

The four layers

1. Cloud buffer — removes tens of minutes

A pool of idle, health-checked GPUs always ready. Booting a VM + health checks takes 10–30 min. By doing it ahead of time and keeping a buffer warm, this cost is off the hot path. The buffer means some GPUs are always idle — a deliberate cost. 100% utilization is fragile.

2. Custom filesystem (ImageFS) — saves ~1 minute

Container images are huge (PyTorch + CUDA + your code = GBs). A lazy-loading, content-addressed, tiered filesystem fixes it:

  • Lazy load — only the index loads upfront. Files fetched on demand or never.
  • Content addressing — files stored by hash. Every container's torch/ is the same bytes, cached once.
  • Tiered cache — host RAM → local SSD → object storage.

After the first container on a host, subsequent ones skip the network for shared deps.

3. CPU memory snapshots — ~10× speedup on host startup

import torch triggers thousands of lines of Python and tens of thousands of OS calls. Do that work once, freeze process memory to disk, restore directly into the post-init state on every cold start.

Split work between snapshotted vs. re-run-each-time (network connections, anything stateful):

@app.cls(enable_memory_snapshot=True)
class InferenceService:
    @modal.enter(snap=True)   # runs once, snapshot here
    def startup(self):
        import torch
        self.model = load_model()

    @modal.enter(snap=False)  # re-runs every cold start
    def finalize(self):
        self.connect_to_db()

Easier when the container runtime is gVisor — the entire kernel state lives in one Go program, so serializing it is straightforward. Snapshots are CPU-architecture specific (need one per ISA on heterogeneous fleets).

4. GPU memory snapshots — 4–10× speedup on device startup

The remaining cost is inference engine setup: CUDA graph capture, Torch compilation, KV cache allocation. Tens of seconds to minutes per fresh GPU.

CUDA graphs are recorded GPU op sequences replayed as one unit. Capturing requires a model dry-run; the result lives in VRAM as raw pointers — no native disk serialization.

The trick: don't serialize objects, save the entire GPU memory space byte-for-byte.

Snapshot:  driver copies VRAM → CPU RAM → ImageFS
Restore:   load → CPU RAM → driver pushes back to VRAM

Everything lands at the same address, so all internal pointers stay valid automatically. Weights are huge but cheap to reload — offload them before snapshotting, snapshot only the graphs, reload weights after. Smaller snapshot, same wins.

Today works cleanly for single-GPU only. Multi-GPU NCCL setups deadlock on checkpoint pause.

How the layers stack

Baseline:                                      ~2000 s

- Cloud Buffer:   off the hot path             tens of minutes
- ImageFS:        lazy + content-addressed     ~1 minute
- CPU Snapshots:  skip Python/lib init         ~10× host startup
- GPU Snapshots:  skip CUDA graph capture       4–10× device startup

Result:                                         ~50 s

Each layer depends on the one below. GPU snapshots are delivered via ImageFS. CPU snapshots make GPU snapshots tractable.

Takeaways

  • Cold start is the enemy of serverless inference. Long cold start → big idle buffer → poor utilization → high cost.
  • The bottleneck moves as you optimize. Fix VM spin-up → filesystem → Python init → GPU setup.
  • Snapshotting is about skipping recomputation, not caching data. The value is storing CUDA graphs that have no other serialization path.

Source: Modal — Truly Serverless GPUs.

On this page