Thermal throttling in the desert.
What happens when midday heat causes GPU thermal limits to kick in — and the configuration changes that kept inference stable without touching the hardware.
The ground station was in a shipping container with a split-unit air conditioner. We installed Runtime on an RTX 4090, loaded a 70B model, and ran validation overnight. Everything looked fine. Latency was stable, thermals were in the low 70s, and the operator signed off.
The problem showed up at noon.
The next day, the operator called. Inference latency had doubled between 11:00 and 14:00. Tokens that were coming back in 120ms were now taking 280ms. We pulled the metrics. GPU clock had dropped from 2.5GHz to 1.4GHz. The card was thermal throttling.
The container AC was sized for the equipment load, not the equipment load plus 45C ambient. By midday, the intake air was 38C. The GPU hit 83C and the firmware stepped the clocks down hard.
The wrong fix is what most people try first.
The instinct is to add cooling. Bigger AC, exhaust fans, water blocks. But this site was remote. Anything that required a hardware shipment meant weeks of delay and customs paperwork. We needed a software fix.
We started with power limits. An RTX 4090 has a 450W TDP. Dropping the power limit to 70% (315W) reduces heat generation significantly while only cutting clock speeds modestly. We tested it: latency went from 280ms back to 170ms. Not perfect, but usable. The GPU stabilized at 76C.
Then we found the bigger win.
The model was running with batch size 1, which meant the GPU was finishing each request quickly and then sitting idle — but the idle state was still pulling 180W because the memory clock stayed high. Runtime has a power-gating setting that drops the memory clock between requests when batch size is low. We enabled it.
Idle power dropped to 90W. Average temperature during the midday window fell to 71C. Latency stayed at 150ms — better than the original overnight baseline, because the card was no longer cycling between hot and cold.
What we changed.
- Power limit set to 70% via
nvidia-smi -pl 315 - Runtime power-gating enabled for low-batch inference
- Thermal alert threshold lowered to 78C so the operator gets warned before throttling starts
- Hub configured to queue requests during thermal events rather than timing out
The takeaway.
Thermal throttling is not a cooling problem — it is a power management problem. If you can reduce the power envelope without collapsing performance, you buy headroom. And if your runtime knows when to gate power between requests, you buy even more.
We did not ship new hardware. We changed four lines of configuration. The site has been stable for eight months.
Read the next note.
Field notes, capability briefs, and deployment playbooks — all in one place.