NVIDIA's Fan Controller Was Ignoring the Temperature
NVIDIA's RTX PRO 6000 Blackwell crashed daily under ML workloads. 95,000 telemetry samples revealed the fan controller was running at 53% while the GPU hit 92C — one degree from thermal shutdown.
The Sound
Fans screaming at full blast. Then silence. Then a black screen.
That's how it started. Every time. The RTX PRO 6000 Blackwell — NVIDIA's flagship professional GPU, 96GB of VRAMVideo RAM — memory that lives on the graphics card itself. Unlike regular RAM (which everything shares), VRAM is dedicated to the GPU. 96GB is absurd — most consumer cards have 8-16GB. Unless you're on Apple Silicon, where unified memory blurs the line, this is a separate pool entirely. Enough to load entire AI models into GPU memory at once. — would ramp its fans to 100%, go completely dark, and take the entire system with it. The only fix was walking over, holding the power button for four seconds, unplugging the PSUPower Supply Unit — the box inside a computer that converts wall power into the voltages components need. 'Unplug the PSU' means physically pulling the power cable from the wall. The IT equivalent of 'have you tried turning it off and on again,' except you have to actually get up from your chair., waiting, and plugging it back in.
This happened multiple times a day. During ML inferenceMachine Learning inference — running a trained AI model to get predictions. Training teaches the model; inference uses it. Think of training as studying for an exam and inference as taking it. Inference hammers the GPU with sustained math for hours, which is exactly the workload that triggers this bug. runs. During actual work.
Five days chasing this. Four different fixes. None worked. What turned up instead was a firmwareFirmware is software that's burned directly into hardware. You can't just uninstall it — it ships with the chip. When firmware has a bug, you're stuck until the manufacturer releases an update. In this case, NVIDIA's firmware controls the fans. And the firmware is wrong. bug so obvious it's hard to believe it shipped — and 95,367 telemetry samples that proved it.
What's a GPU and Why Does It Need Fans?
If you've never thought about GPU cooling before, here's the short version.
A GPU is a processor. It does math — lots of it, very fast. All that math generates heat. A high-end GPU draws 600 wattsFor context: a typical gaming desktop uses 300-500W total. This single GPU card draws 600W by itself — more than the rest of the computer combined. That's why it needs its own cooling system. Basically a space heater that happens to do math. at peak. That's more than most space heaters.
To keep it alive, there's a cooling system. Metal heatsinksHeatsinks are chunks of metal (usually copper or aluminum) with fins that sit on top of the chip. They work like a car radiator — absorb heat from the chip, spread it across a large surface area, and let air carry it away. Without a heatsink, a GPU would melt through its circuit board in seconds. Not an exaggeration. pull heat away from the chip. Fans blow air across those heatsinks. And firmware — a small program baked into the GPU — watches the temperature and tells the fans how fast to spin.
Hotter chip, faster fans. Simple feedback loop. When it works.
The firmware is the brain of this operation. It reads the temperature sensor, decides how fast the fans should run, and sends that command. If the firmware gets it wrong, the fans don't respond to temperature. And if the fans don't respond, the GPU cooks itself.
That's exactly what happened here.
The Obvious Suspects
First stop: NVIDIA's GitHub. Issue #1045 — dozens of people with the same crash. RTX 5080, 5090, PRO 6000. Same symptoms. NVIDIA had acknowledged a firmware bug internally but hadn't shipped a fix.
The community consensus was "GSPGPU System Processor — a tiny computer inside your GPU that manages everything: power, temperature, fan speed, clock speeds. Think of it as the GPU's operating system. When the GSP crashes, the GPU doesn't just slow down — it completely disconnects. It's like the pilot ejecting from a plane. firmware bug." The GSP is the GPU System Processor — a microcontroller that manages everything on the GPU. People said it was crashing for unknown reasons.
The obvious stuff:
- Driver downgrade: 590 to 580. Crashed on both.
- Persistence modeA mode where the GPU driver stays loaded between tasks instead of reinitializing each time. Normally helps with stability. Did not help here. The bug is deeper than the driver.: Enabled it. Still crashed.
- Power limit reduction: 600W down to 500W, then 450W. Crashed at 500W. Lasted longer at 450W but still died.
- PCIePCIe is the high-speed connection between your GPU and the rest of your computer. Think of it as a highway for data. ASPM is a power-saving feature that lets PCIe lanes 'nap' when idle. Disabling it was a long shot — the theory was that the GPU was losing its connection during a power state transition. It wasn't. power management: Disabled ASPMActive State Power Management — a power-saving feature where your computer dims unused data lanes to save electricity. Like automatic headlights on a car, except sometimes the headlights turn off while you're still driving. That was the theory here. It wasn't the problem.. Didn't help.
Five crashes across four days. Nothing worked. "GSP firmware bug" explained what was crashing but not why. Something was triggering it. Time for data.
Building the Monitoring
Round 1: The Quick and Dirty Way
A bash script. Poll nvidia-smiNVIDIA's command-line dashboard for GPUs. Run it in your terminal and it shows you temperature, power draw, fan speed, memory usage — a one-stop health readout. The problem? It's slow. It spawns a whole new process every time you call it, and it can only sample about once per second. Fine for checking on things. Not fine for catching a crash that happens in 300 milliseconds. every 2 seconds, dump to CSV. Temperature, power, fan speed, clock speeds, throttle reasons. Basic.
Within an hour, something weird showed up. The fans were spiking to 100% during steady workloads, then dropping back. No crash, no error — just a sudden noise blast that lasted 7 seconds. And it was happening at 82C. That's not even close to the thermal limit.
Round 2: The Right Way
nvidia-smi is a blunt tool. It spawns a new process every call, only shows one aggregate fan speed, and can't expose the throttle reason bitmaskA number where each individual bit (binary digit) represents a different condition. Imagine a row of 8 light switches — each one means something different (overheating, power limit hit, clock throttled, etc.). nvidia-smi just tells you 'something's wrong.' The bitmask tells you exactly which switches are flipped, even if they only stay on for a tenth of a second.. A rewrite in Python using NVMLNVIDIA Management Library — a C library (with Python bindings) that talks directly to the GPU driver. Unlike nvidia-smi, NVML doesn't spawn a new process each call. It can poll 10x per second, read each fan individually, decode throttle bitmasks, and subscribe to GPU events asynchronously. The difference between checking your phone for notifications and having them pushed to your watch. directly changed everything.
The difference was immediate. NVML showed me things nvidia-smi couldn't:
- Per-fan speed AND target: Both fans at 52%, but the firmware target was 30%. The firmware was overriding its own setting — but only to 52%, not the 80-90% the temperature demanded.
- Throttle reason bitmask: Transient flags that appeared for a single 100ms sample. Invisible to
nvidia-smi. - GPU events via interrupt API: No polling needed. Asynchronous notification of state changes.
The monitoring script got anomaly detectors. Flag when the GPU is over 85C but fans are under 55%. Flag when power exceeds the limit by 50W or more. And automatic crash snapshots — when the GPU dies, dump the last 50 readings, full dmesgShort for 'diagnostic messages' — the Linux kernel's log. Every piece of hardware in your computer whispers into dmesg when something happens. GPU crashed? dmesg knows. Driver loaded? dmesg saw it. The black box flight recorder of Linux. First place you look after anything goes wrong., PCIe state, and error counters.
The 8-Hour Run
The telemetry ran through a full day of ML workload at a 450W power limit. 95,367 samples. Eight hours. The numbers were bad.
Look at that data. The firmware was supposed to cap power at 450W. It let 632W through. That's 40% over the limit. Nearly half of all samples — 42,703 out of 95,367 — exceeded the power cap the firmware was supposed to enforce.
And the fans. 10,338 times the GPU was hot but the fans weren't responding. The firmware's target was 30%. It ran at 53%. At 89C — four degrees from the thermal maximum — the fans hadn't moved.
The fan controller wasn't just bad. It was ignoring temperature entirely.
10,338 fan-temperature inversions in 8 hours. The firmware was watching the GPU cook itself and doing nothing.
The Crash, Captured
Power limit removed. Back to 600W default. Telemetry armed and ready. Seventeen minutes later, the GPU crashed.
But this time the monitoring captured everything. Every 100 milliseconds leading up to the moment it died.
Step through the last 5 seconds:
Here's what that sequence shows. The GPU hit 92C — one degree from its absolute maximum — with fans at 53%. The firmware didn't react until 92C, which is absurdly late. When the fan controller finally started ramping, it had less than 2 seconds before the GPU System Processor hit its thermal fault threshold.
Two seconds isn't enough. Fans are physical things. They have mass. They can't spin from 53% to 100% instantly. By the time they got to 90%, the GSP had already pulled the plug.
The GPU fell off the PCIe bus'Fell off the bus' means the GPU physically disconnected from the computer's data highway (PCIe). The CPU tries to talk to the GPU and gets back 0xffffffff — which in hardware-speak means 'nobody's home.' Not a software crash. The GPU unplugged itself. The only fix is literally unplugging the power supply and plugging it back in.. Register reads returned 0xffffffff. The kernelThe kernel is the core of your operating system — the layer that talks directly to hardware. 'The kernel couldn't talk to it' means Linux itself lost contact with the GPU. Not a program crashed. Not a driver crashed. The operating system's connection to the hardware was severed. couldn't talk to it anymore. The only way back was a full power cycle — unplug the PSU, wait, plug it back in.
What Was Actually Wrong
The data made it obvious. This wasn't some mysterious GSP firmware glitch. It was a fan controller that didn't respond to temperature.
Look at the "What actually happened" tab. It's a flat line. 53% fan speed from idle to 92C. One panic spike at 82C that lasts 7 seconds, then right back to 53%.
Compare that to what should happen. A smooth ramp — hotter GPU, faster fans. Every GPU ever made does this. Except this one.
The GSP wasn't crashing randomly. It was crashing because the fan controller let the GPU overheat until the thermal fault threshold triggered a protective shutdown. The firmware had three separate failures:
- Fan controller: Targeted 30%, ran at 53%, ignored temperatures up to 92C
- Power limit enforcement: Let 632W through a 450W cap
- Emergency throttle timing:
SwThermalSlowdownSwThermalSlowdown is the firmware's 'oh no' brake. When the GPU gets dangerously hot, this flag fires and forces the GPU to slow its clock speed — like putting a car in a lower gear to reduce engine strain. The problem? It only triggered at 92C. That's 1 degree from the hard shutdown limit. By the time the brakes hit, the car was already off the cliff. only fired at 92C — too late to prevent the crash
The Fix
Take fan control away from the firmware. Manage it from userspaceUserspace means 'regular software that runs on top of the operating system,' as opposed to firmware (baked into hardware) or the kernel (the OS core). When firmware is broken and there's no update, sometimes the fix is a userspace program that does the firmware's job better..
NVML provides nvmlDeviceSetFanSpeed_v2() for exactly this. A fan control daemonA daemon is a program that runs in the background, always on, doing its job quietly. Like a thermostat for your house — set the rules once, it keeps running. This fan daemon watches GPU temperature and adjusts fan speed continuously, replacing the broken firmware logic. that reads GPU temperature and applies a sane fan curveA fan curve maps temperature to fan speed — 'at this temp, spin the fans this fast.' A good fan curve is a smooth ramp: hotter GPU, faster fans. The firmware's curve was a flat line at 53%. The replacement goes from 30% at idle up to 100% at 88C. Boring, predictable, keeps the GPU alive.. 30% at idle, ramps linearly to 95% at 85C, hits 100% at 88C. Emergency override that bypasses all hysteresisHysteresis is a dead zone that prevents oscillation. Without it, if fans speed up at 80C and slow down at 79C, the GPU would bounce between 79-80 forever and the fans would sound like a helicopter with hiccups. A 3C hysteresis band means: speed up at 80, don't slow down until 77. Smooth.. Fails safe — if NVML stops responding, all fans go to 100%.
Toggle between the broken and fixed states to see what changed:
Twenty degrees cooler. Zero crashes. The card runs like nothing was ever wrong.
The firmware's fan controller was the entire problem. Once bypassed, every symptom disappeared. No thermal throttling. No power limit violations. No fan-temperature inversions. No PCIe drops.
A flagship GPU was crashing because its fan controller was stuck at 53%.
Both scripts are MIT licensed and available as GitHub gists: telemetry monitoring and fan control daemon. If you're hitting the same crashes, please add your telemetry data to issue #1045. More data helps NVIDIA prioritize the fix.
Frequently Asked Questions
Why does the NVIDIA RTX PRO 6000 Blackwell crash under sustained ML workloads?
The GPU's fan controller firmware has a bug where it targets 30% fan speed and runs at ~53% regardless of GPU temperature, even at 92C (1 degree from thermal max). This causes thermal runaway that crashes the GPU System Processor, disconnecting the GPU from the PCIe bus entirely.
How do you diagnose GPU fan controller issues on Linux?
Poll NVML directly at 100ms intervals instead of using nvidia-smi. Track per-fan speed vs target, decode the throttle reason bitmask, and flag fan-temperature inversions where the GPU is hot but fans aren't responding. The telemetry data is far more useful than post-crash nvidia-bug-report logs.
How do you fix RTX 50-series GPU crashes caused by the fan controller bug?
Take fan control away from the firmware using NVML's nvmlDeviceSetFanSpeed_v2 API. A userspace fan daemon that reads GPU temperature and applies a proper fan curve (30% idle to 100% at 88C) dropped temperatures from 92C to 73C and eliminated all crashes.