A vLLM OOM crash is one of the more expensive five-line stack traces in production ML. The model loaded, the first few requests worked, then a long prompt or a burst of concurrency pushed the KV cache past the reserved block and the serving process terminated. Your users see 500s. Your pager fires at 3 a.m.
Almost every vLLM OOM has one of four causes, and they fix in a predictable order. Work through the eight steps below in sequence. Most services recover by step three. If they do not, the problem is not configuration. The model genuinely does not fit in VRAM at the precision you are running, and the answer is memory tiering.
The four root causes of vLLM OOM
Before the checklist, the diagnosis frame. You have to know which bucket you are in before you start changing knobs.
- Configuration overshoot.
gpu_memory_utilizationis set too aggressively for the host. The reserved block collides with the driver, the display server, or another tenant. - KV cache blowout.
max_model_lenis set to the model’s theoretical maximum (e.g. 128k for Llama 3) while your real traffic peaks at 8k, so you are reserving (and eventually thrashing) KV for context you never use. - Concurrency misjudgement.
max_num_seqslets 256 sequences run concurrently on a GPU that can physically host 8. The first burst that approaches that limit drops the server. - The model genuinely does not fit. At the precision you chose, the weights plus the live KV cache plus activation buffers plus CUDA graphs exceed the card. No amount of knob-turning fixes physics.
The first three are configuration. Steps 1 through 7 below address them. Step 8 is the cure for cause 4.
The eight-step checklist
Step 1. Measure the actual VRAM budget
Before anything else, get the truth about how much VRAM you have, not how much the datasheet claims.
# Before launching vLLM, on the target host
nvidia-smi --query-gpu=memory.total,memory.used,memory.free \
--format=csv,noheader,nounits
Note the memory.free figure after your OS, display server (if any), and any sibling CUDA processes have loaded. This is your budget. On a 24 GB RTX 4090 with a monitor attached and a desktop environment, expect around 22 GB free. On a headless A100 80 GB in a datacentre, expect close to 80 GB. Write the number down.
Step 2. Lower gpu_memory_utilization
vLLM pre-allocates a contiguous memory block at startup sized by gpu_memory_utilization. The default of 0.90 is aggressive for workstation cards and for shared hosts.
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.85, # start here on 24 GB cards
)
Set to 0.80 on first pass for any host that runs anything else on the GPU (display server, Docker GPU monitoring sidecar, telemetry exporter). Tune up only after you have proved headroom with a peak-load test (step 7).
Step 3. Cap max_model_len to real traffic
This is the step that rescues the most services. The default max_model_len is the model’s maximum position (8k, 32k, 128k) and vLLM sizes its KV pool to that ceiling. A Llama-3-70B in fp16 at 32k context with eight sequences costs about 80 GB of KV cache. You almost certainly do not need 32k. Look at your logs for the 99th-percentile prompt-plus-response length, round up, and pin it.
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.85,
max_model_len=8192, # cover your real P99, not the model's theoretical max
)
On a 24 GB card, moving from 32k to 8k will often be the difference between OOM and comfort.
Step 4. Cap max_num_seqs
max_num_seqs is the maximum number of sequences vLLM can hold in flight at once. Each sequence consumes its own KV block. The default of 256 is optimistic for most deployments.
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.85,
max_model_len=8192,
max_num_seqs=16, # match your real peak concurrency
)
Right-size this to the concurrency your service actually sees, including planned bursts. If you run behind a load balancer with per-instance concurrency control, max_num_seqs should match that concurrency cap.
Step 5. Quantise weights if the workload tolerates it
Quantisation cuts weight memory, not KV memory, but for weight-heavy OOMs that is enough. The rule of thumb for Llama-3-70B:
| Precision | Weight size | Fits on 24 GB? |
|---|---|---|
| fp16 | ~140 GB | No |
| int8 | ~70 GB | No |
| int4 (AWQ / GPTQ) | ~38 GB | Not in pure VRAM |
| Q4_K_S (GGUF) | ~37 GB | Partial, needs tiering |
A Llama-3-8B in int4 is about 5 GB of weights, plenty of headroom on a 24 GB card for KV cache even at 32k. A Llama-3-70B remains a two-tier problem even at int4.
Step 6. Use prefix caching only when it pays
enable_prefix_caching shares KV blocks across requests that start with the same prefix. For RAG with a stable system prompt, or agent loops with a persistent persona, this is a massive hit-rate win. For workloads with no prefix overlap, the cache reserves VRAM that the live KV could have used.
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
enable_prefix_caching=True, # only if your prompts actually share prefixes
)
The rule: measure first. vLLM exposes prefix_cache_hit_rate in its metrics. If it is below 30 per cent, turn prefix caching off.
Step 7. Run a synthetic peak-load profile
Before you declare the crash fixed, you have to prove it. Build a synthetic load generator that hits your service with peak prompt length and peak concurrency for five minutes. Watch nvidia-smi in a second terminal. Watch vLLM’s /metrics endpoint for gpu_cache_usage_perc. If you cannot run this test, you do not have a production deployment. You have an unreleased beta.
# Example with vegeta (or any HTTP load tool)
echo "POST http://localhost:8000/v1/completions" \
| vegeta attack -duration=5m -rate=20 -body=peak-prompt.json \
| vegeta report
If OOM occurs under synthetic peak, tighten steps 3 and 4 further, or proceed to step 8.
Step 8. Tier memory when the model does not fit
You have lowered gpu_memory_utilization, capped max_model_len, limited max_num_seqs, quantised as far as the workload tolerates, and the model still does not fit. At this point the problem is not vLLM. It is the fact that a 70B-class model at reasonable precision, with production-scale KV cache, exceeds 24 GB of VRAM.
The solution is memory tiering. Tier means: decide which model pages and which KV-cache blocks live in VRAM (fast, tiny), which live in host RAM (medium, larger), and which live on NVMe (slow, enormous). Move pages between tiers based on access patterns.
Pure vLLM does not do this. It is a VRAM-only engine, and that is the right design trade for datacentre-class hardware. For 24 GB cards with 70B-class models, you need a layer on top that orchestrates across tiers.
Two ways to get there:
- llama.cpp supports explicit GPU-layer splits. You load N of the model’s 80 layers onto the GPU, the rest on CPU. It works. The trade is latency: CPU layers are orders of magnitude slower, and the split is static.
- Sector88 Runtime sits as a tier-aware orchestrator on top of vLLM and llama.cpp backends. It detects available VRAM at startup, places hot layers and active KV blocks in VRAM, promotes and demotes pages on the fly, and exposes an OpenAI-compatible API so nothing upstream has to change. Our published RTX 4090 benchmark shows Llama-3-70B stable at 22.93 GB VRAM peak with 43 of 80 layers on GPU, the rest tiered into RAM and NVMe.
Either path removes the “it does not fit” OOM class entirely. Which one you pick depends on how much custom infrastructure you want to own.
A quick reference config for the three cases
24 GB GPU, Llama-3-8B, production chat
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.85,
max_model_len=8192,
max_num_seqs=32,
enable_prefix_caching=False, # flip to True only if metrics justify
)
40 GB A100, Llama-3-70B AWQ, agent workload
llm = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct-AWQ",
quantization="awq",
gpu_memory_utilization=0.88,
max_model_len=8192,
max_num_seqs=16,
enable_prefix_caching=True, # agents share system prompts; hit rate > 60%
)
24 GB GPU, Llama-3-70B, sovereign deployment
Pure vLLM will not fit this case. Run a tiered runtime (llama.cpp with GPU-layer split, or Sector88 Runtime on top) and accept the latency trade that the tier crossing imposes. The stability win (no OOM, no tail-latency spikes from CUDA allocator thrash) is almost always worth it for on-prem deployments where you cannot just “add more GPUs”.
What not to do
- Do not set
gpu_memory_utilizationto 1.0. You are asking the OOM killer to do your capacity planning. - Do not run vLLM and another CUDA workload on the same card with default settings. Either isolate the GPU or carefully budget for both.
- Do not keep
max_model_lenat the model’s maximum “because we might need it later”. You will pay for that context every single inference, today, in KV cache memory. - Do not chase performance without stability first. An engine that serves at 10 tokens/s for five minutes and then crashes is an engine that serves at 0 tokens/s. Fix OOM first, then tune throughput.
- Do not blame vLLM. vLLM is a great engine for the job it is designed for. It is a single-tier allocator optimised for datacentre cards. When you push it past that envelope, reach for a tier-aware layer instead of fighting the tool.
Summary
The path from “vLLM keeps crashing” to “vLLM serves reliably” is usually three small configuration changes. Cap gpu_memory_utilization. Cap max_model_len to your real traffic. Cap max_num_seqs to your real concurrency. If those three fixes do not cover you, go further down the checklist: quantise, profile, and ultimately tier.
The rule of thumb: if the model fits in VRAM at the precision you want, stay in pure vLLM and tune it. If the model does not fit, stop trying to force it. Put a memory-tiering layer between your application and the engine, and let the tiers do what they were designed to do.