Why does vLLM crash with CUDA out of memory when nvidia-smi shows free VRAM?

vLLM pre-allocates a large contiguous block at startup sized by gpu_memory_utilization. If another CUDA process grabs memory after that allocation, or if KV cache growth during inference pushes past the reserved block, you see an OOM even though nvidia-smi reports free memory in aggregate. The fix is to lower gpu_memory_utilization and isolate the GPU to one tenant.

What is the safest gpu_memory_utilization for a 24 GB GPU?

On a 24 GB card with a display server attached, 0.80 to 0.85 is the safe range. The default of 0.90 assumes a headless server-grade card with no other GPU tenants. If you run on a workstation RTX 4090 with a monitor, set it to 0.80 first and tune up only if benchmarks prove you have headroom.

How do I calculate the KV cache size for a model?

KV cache per token is roughly 2 × num_layers × num_kv_heads × head_dim × bytes_per_element. For a Llama-3-70B in fp16 that is about 320 KB per token. A 32k context, 8-sequence batch therefore needs about 80 GB of KV cache alone. vLLM exposes this with its --enforce-eager and logging flags; run it once and read the startup log.

Does quantisation fix OOM?

It helps with the model-weight share of VRAM, not with the KV cache share. A 4-bit Llama-3-70B drops weights from ~140 GB to ~38 GB, but the KV cache still scales with context length and batch size in whatever dtype you configured. If your OOM is driven by long contexts or high concurrency, quantisation alone will not rescue you. You need to reduce max_model_len, reduce max_num_seqs, or tier the cache.

Can I run Llama 3 70B on a single 24 GB GPU with vLLM?

Not in pure vLLM at fp16. The weights alone exceed the card. With aggressive quantisation (Q4_K_S or similar) and a backend that supports GPU/CPU layer splitting (which vLLM does not cleanly do, but llama.cpp does), the model fits across VRAM and system RAM with a performance trade. Sector88 Runtime is designed specifically for this case and we published a benchmark showing it stable at 22.93 GB VRAM peak on RTX 4090 (see sources).

What is memory tiering and when do I need it?

Memory tiering places different parts of the model and KV cache on different memory tiers: hot pages in VRAM, warm pages in host RAM (pinned for DMA), cold pages on NVMe. You need it when the model does not fit in VRAM even with reasonable quantisation, or when you want to run longer contexts than raw VRAM permits. The trade is latency: the deeper a page lives, the longer it takes to pull back. A well-designed tiering system makes that trade predictable.

Will enabling prefix caching cause OOM?

It can. Prefix caching reserves VRAM for shared prompt prefixes across requests. If your prompts share little or nothing, the cached blocks sit idle and starve the live KV cache. Turn it on only when your workload genuinely has shared prefixes: agents with a stable system prompt, RAG with a repeated persona block, multi-turn chat with long history.

How to fix vLLM OOM: the complete 2026 checklist

A vLLM OOM crash is one of the more expensive five-line stack traces in production ML. The model loaded, the first few requests worked, then a long prompt or a burst of concurrency pushed the KV cache past the reserved block and the serving process terminated. Your users see 500s. Your pager fires at 3 a.m.

Almost every vLLM OOM has one of four causes, and they fix in a predictable order. Work through the eight steps below in sequence. Most services recover by step three. If they do not, the problem is not configuration. The model genuinely does not fit in VRAM at the precision you are running, and the answer is memory tiering.

The four root causes of vLLM OOM

Before the checklist, the diagnosis frame. You have to know which bucket you are in before you start changing knobs.

Configuration overshoot. gpu_memory_utilization is set too aggressively for the host. The reserved block collides with the driver, the display server, or another tenant.
KV cache blowout. max_model_len is set to the model’s theoretical maximum (e.g. 128k for Llama 3) while your real traffic peaks at 8k, so you are reserving (and eventually thrashing) KV for context you never use.
Concurrency misjudgement. max_num_seqs lets 256 sequences run concurrently on a GPU that can physically host 8. The first burst that approaches that limit drops the server.
The model genuinely does not fit. At the precision you chose, the weights plus the live KV cache plus activation buffers plus CUDA graphs exceed the card. No amount of knob-turning fixes physics.

The first three are configuration. Steps 1 through 7 below address them. Step 8 is the cure for cause 4.

The eight-step checklist

Step 1. Measure the actual VRAM budget

Before anything else, get the truth about how much VRAM you have, not how much the datasheet claims.

# Before launching vLLM, on the target host
nvidia-smi --query-gpu=memory.total,memory.used,memory.free \
           --format=csv,noheader,nounits

Note the memory.free figure after your OS, display server (if any), and any sibling CUDA processes have loaded. This is your budget. On a 24 GB RTX 4090 with a monitor attached and a desktop environment, expect around 22 GB free. On a headless A100 80 GB in a datacentre, expect close to 80 GB. Write the number down.

Step 2. Lower `gpu_memory_utilization`

vLLM pre-allocates a contiguous memory block at startup sized by gpu_memory_utilization. The default of 0.90 is aggressive for workstation cards and for shared hosts.

from vllm import LLM
llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.85,  # start here on 24 GB cards
)

Set to 0.80 on first pass for any host that runs anything else on the GPU (display server, Docker GPU monitoring sidecar, telemetry exporter). Tune up only after you have proved headroom with a peak-load test (step 7).

Step 3. Cap `max_model_len` to real traffic

This is the step that rescues the most services. The default max_model_len is the model’s maximum position (8k, 32k, 128k) and vLLM sizes its KV pool to that ceiling. A Llama-3-70B in fp16 at 32k context with eight sequences costs about 80 GB of KV cache. You almost certainly do not need 32k. Look at your logs for the 99th-percentile prompt-plus-response length, round up, and pin it.

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.85,
    max_model_len=8192,  # cover your real P99, not the model's theoretical max
)

On a 24 GB card, moving from 32k to 8k will often be the difference between OOM and comfort.

Step 4. Cap `max_num_seqs`

max_num_seqs is the maximum number of sequences vLLM can hold in flight at once. Each sequence consumes its own KV block. The default of 256 is optimistic for most deployments.

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.85,
    max_model_len=8192,
    max_num_seqs=16,  # match your real peak concurrency
)

Right-size this to the concurrency your service actually sees, including planned bursts. If you run behind a load balancer with per-instance concurrency control, max_num_seqs should match that concurrency cap.

Step 5. Quantise weights if the workload tolerates it

Quantisation cuts weight memory, not KV memory, but for weight-heavy OOMs that is enough. The rule of thumb for Llama-3-70B:

Precision	Weight size	Fits on 24 GB?
fp16	~140 GB	No
int8	~70 GB	No
int4 (AWQ / GPTQ)	~38 GB	Not in pure VRAM
Q4_K_S (GGUF)	~37 GB	Partial, needs tiering

A Llama-3-8B in int4 is about 5 GB of weights, plenty of headroom on a 24 GB card for KV cache even at 32k. A Llama-3-70B remains a two-tier problem even at int4.

Step 6. Use prefix caching only when it pays

enable_prefix_caching shares KV blocks across requests that start with the same prefix. For RAG with a stable system prompt, or agent loops with a persistent persona, this is a massive hit-rate win. For workloads with no prefix overlap, the cache reserves VRAM that the live KV could have used.

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    enable_prefix_caching=True,  # only if your prompts actually share prefixes
)

The rule: measure first. vLLM exposes prefix_cache_hit_rate in its metrics. If it is below 30 per cent, turn prefix caching off.

Step 7. Run a synthetic peak-load profile

Before you declare the crash fixed, you have to prove it. Build a synthetic load generator that hits your service with peak prompt length and peak concurrency for five minutes. Watch nvidia-smi in a second terminal. Watch vLLM’s /metrics endpoint for gpu_cache_usage_perc. If you cannot run this test, you do not have a production deployment. You have an unreleased beta.

# Example with vegeta (or any HTTP load tool)
echo "POST http://localhost:8000/v1/completions" \
  | vegeta attack -duration=5m -rate=20 -body=peak-prompt.json \
  | vegeta report

If OOM occurs under synthetic peak, tighten steps 3 and 4 further, or proceed to step 8.

Step 8. Tier memory when the model does not fit

You have lowered gpu_memory_utilization, capped max_model_len, limited max_num_seqs, quantised as far as the workload tolerates, and the model still does not fit. At this point the problem is not vLLM. It is the fact that a 70B-class model at reasonable precision, with production-scale KV cache, exceeds 24 GB of VRAM.

The solution is memory tiering. Tier means: decide which model pages and which KV-cache blocks live in VRAM (fast, tiny), which live in host RAM (medium, larger), and which live on NVMe (slow, enormous). Move pages between tiers based on access patterns.

Pure vLLM does not do this. It is a VRAM-only engine, and that is the right design trade for datacentre-class hardware. For 24 GB cards with 70B-class models, you need a layer on top that orchestrates across tiers.

Two ways to get there:

llama.cpp supports explicit GPU-layer splits. You load N of the model’s 80 layers onto the GPU, the rest on CPU. It works. The trade is latency: CPU layers are orders of magnitude slower, and the split is static.
Sector88 Runtime sits as a tier-aware orchestrator on top of vLLM and llama.cpp backends. It detects available VRAM at startup, places hot layers and active KV blocks in VRAM, promotes and demotes pages on the fly, and exposes an OpenAI-compatible API so nothing upstream has to change. Our published RTX 4090 benchmark shows Llama-3-70B stable at 22.93 GB VRAM peak with 43 of 80 layers on GPU, the rest tiered into RAM and NVMe.

Either path removes the “it does not fit” OOM class entirely. Which one you pick depends on how much custom infrastructure you want to own.

A quick reference config for the three cases

24 GB GPU, Llama-3-8B, production chat

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.85,
    max_model_len=8192,
    max_num_seqs=32,
    enable_prefix_caching=False,  # flip to True only if metrics justify
)

40 GB A100, Llama-3-70B AWQ, agent workload

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.88,
    max_model_len=8192,
    max_num_seqs=16,
    enable_prefix_caching=True,   # agents share system prompts; hit rate > 60%
)

24 GB GPU, Llama-3-70B, sovereign deployment

Pure vLLM will not fit this case. Run a tiered runtime (llama.cpp with GPU-layer split, or Sector88 Runtime on top) and accept the latency trade that the tier crossing imposes. The stability win (no OOM, no tail-latency spikes from CUDA allocator thrash) is almost always worth it for on-prem deployments where you cannot just “add more GPUs”.

What not to do

Do not set gpu_memory_utilization to 1.0. You are asking the OOM killer to do your capacity planning.
Do not run vLLM and another CUDA workload on the same card with default settings. Either isolate the GPU or carefully budget for both.
Do not keep max_model_len at the model’s maximum “because we might need it later”. You will pay for that context every single inference, today, in KV cache memory.
Do not chase performance without stability first. An engine that serves at 10 tokens/s for five minutes and then crashes is an engine that serves at 0 tokens/s. Fix OOM first, then tune throughput.
Do not blame vLLM. vLLM is a great engine for the job it is designed for. It is a single-tier allocator optimised for datacentre cards. When you push it past that envelope, reach for a tier-aware layer instead of fighting the tool.

Summary

The path from “vLLM keeps crashing” to “vLLM serves reliably” is usually three small configuration changes. Cap gpu_memory_utilization. Cap max_model_len to your real traffic. Cap max_num_seqs to your real concurrency. If those three fixes do not cover you, go further down the checklist: quantise, profile, and ultimately tier.

The rule of thumb: if the model fits in VRAM at the precision you want, stay in pure vLLM and tune it. If the model does not fit, stop trying to force it. Put a memory-tiering layer between your application and the engine, and let the tiers do what they were designed to do.

How to fix vLLM OOM: the complete 2026 checklist

The four root causes of vLLM OOM

The eight-step checklist

Step 1. Measure the actual VRAM budget

Step 2. Lower `gpu_memory_utilization`

Step 3. Cap `max_model_len` to real traffic

Step 4. Cap `max_num_seqs`

Step 5. Quantise weights if the workload tolerates it

Step 6. Use prefix caching only when it pays

Step 7. Run a synthetic peak-load profile

Step 8. Tier memory when the model does not fit

A quick reference config for the three cases

24 GB GPU, Llama-3-8B, production chat

40 GB A100, Llama-3-70B AWQ, agent workload

24 GB GPU, Llama-3-70B, sovereign deployment

What not to do

Summary

Frequently asked questions

Sources

How to fix vLLM OOM: the complete 2026 checklist

The four root causes of vLLM OOM

The eight-step checklist

Step 1. Measure the actual VRAM budget

Step 2. Lower gpu_memory_utilization

Step 3. Cap max_model_len to real traffic

Step 4. Cap max_num_seqs

Step 5. Quantise weights if the workload tolerates it

Step 6. Use prefix caching only when it pays

Step 7. Run a synthetic peak-load profile

Step 8. Tier memory when the model does not fit

A quick reference config for the three cases

24 GB GPU, Llama-3-8B, production chat

40 GB A100, Llama-3-70B AWQ, agent workload

24 GB GPU, Llama-3-70B, sovereign deployment

What not to do

Summary

Frequently asked questions

Sources

Step 2. Lower `gpu_memory_utilization`

Step 3. Cap `max_model_len` to real traffic

Step 4. Cap `max_num_seqs`