Skip to content
dockeroverlayfsssdperformancelinuxdebugging

Fifty Cents a Day Doesn't Sound Like Much

Until you do the math.

My Solid-State Drive — the storage in modern computers. Unlike old hard drives with spinning metal platters, SSDs use memory chips. They're fast, silent, and have one fatal flaw: every single write wears the chips out slightly. Think of it like a pencil eraser — works great for thousands of uses, but eventually there's nothing left. Manufacturers rate them with a 'total bytes written' guarantee. Once you exceed it, you're living on borrowed time. cost $600. It's rated for 1,000 terabytes of total writes — a A petabyte is 1,000 terabytes. A terabyte is roughly 1,000 copies of the entire English Wikipedia. So a petabyte is a million Wikipedias. This drive can absorb a million Wikipedias before it wears out. Sounds like a lot. It isn't, when you're accidentally writing 840 copies every single day.. That sounds like a lot, and for normal use, it is. You'd have to try pretty hard to wear it out.

I was trying very hard. I just didn't know it.

Every day, my system was writing over a terabyte of unnecessary data to this drive — from half a dozen different sources I hadn't thought to measure. At the worst point, the drive was hitting 4,380 megabytes per second of writes. That's nearly its physical maximum. The drive was screaming.

At that rate, I was grinding through $180 worth of drive life every year, burning 30% of the SSD's total lifespan annually, turning a drive that should last a decade into a three-year countdown.

Here's what that looks like:

$600 SSD — Lifespan Drain840 GB / day · $0.50 in wear
100.0%drive health remaining
0
days elapsed
0 GB
written to SSD
$0.00
in drive life burned
$600
drive cost
A $600 SSD rated for 1 petabyte. Watch what 840 GB of daily writes does to it over a year.

This is the story of how I found six different ways my system was thrashing an SSD — and how the worst offender turned out to be a script I wrote to protect it.

What I Was Building

I run an Think of it like a standardized test for AI coding assistants. You give each AI agent the same set of real bugs from real open-source projects, let them try to fix each one, then grade them. Did the fix work? Did the tests pass? It's how you know whether an AI tool actually helps or just sounds confident. The answer varies more than you'd expect. — it tests AI coding agents against real bugs from production open-source projects. 500 bug-fix tasks. 200 feature-building tasks. Different tools, different models, different configurations.

Each task gets its own sealed sandbox — a Docker creates isolated environments called 'containers.' Imagine a sealed room with everything a program needs: its own operating system, its own files, its own tools. The program inside can't see or touch anything outside the room. When the program finishes, you demolish the room. Every AI agent in my system gets its own room, works for 30 minutes, and gets demolished afterward. About 600 rooms per day.. Spin it up, let the agent work for 30 minutes, capture results, tear it down. About 600 of these per campaign. A full campaign takes 20 hours. Campaigns run daily.

600
Containers
Spun up and torn down every campaign
20 hrs
Per Campaign
Running daily, 7 days a week
$600
SSD Cost
Rated for 1 petabyte total writes

The machine is built for this: a A Graphics Processing Unit — originally designed to render video game graphics, now the engine behind AI. GPUs can do thousands of math operations simultaneously, which happens to be exactly what AI models need. This one is an NVIDIA RTX PRO 6000 with 96 GB of dedicated memory and a 600-watt appetite. It draws more power than a space heater. It also means I can run AI models locally instead of paying cloud providers per query. running AI models locally (the same one that had its own drama with a firmware bug), dozens of processes managing container lifecycles, constant disk activity. Heavy workload.

So when I first noticed the SSD taking a beating, the reaction was: "Yeah, that tracks."

It didn't track. Not even close.

The Monitoring Script That Changed Everything

I'd built a small monitoring tool — 53 lines of code that read the operating system's disk statistics twice per second, calculated write speed and drive utilization, and logged the numbers. Linux tracks every read and write operation in a file called /proc/diskstats. It's like a mileage counter for your hard drive, updated in real-time by the kernel. If you read it twice a second and subtract the old numbers from the new ones, you get live write speed. Free, built into every Linux system, and more useful than any expensive monitoring tool I've ever used., it was reading /proc/diskstats and doing subtraction. Took about as long to write as making a sandwich.

That sandwich-effort script revealed a disaster I'd been blind to for weeks.

I kicked off a batch of 16 containers and watched the monitor:

Caught in the Act — 16 Containers Starting Up
real data
My 53-line monitoring script captured this. Each bar shows write speed vs. the drive's physical maximum. Red means the drive is nearly pegged.
0 drive max: 5,000 MB/s
12:37:37
81%
12:37:38
87%
12:37:39
85%
12:37:40
88%
12:37:42
90%
12:37:45
91%
12:37:50
93%
12:38:00
94%
12:38:05
94%
For 40 seconds, the SSD was pinned at 88–94% of its physical limit — not from AI agents doing real work, but from my “protection” script setting up RAM overlays. The AI agents hadn't even started yet.

The SSD was pegged. 4,380 megabytes per second of writes. 94% drive utilization. And the AI agents hadn't even started yet — the writes were happening during container setup, before a single line of AI code ran.

The monitoring script told the real story in one hour. Weeks of code review had missed it entirely.

Six Ways to Thrash an SSD

Once I started measuring, I found the writes were coming from everywhere. Not one big leak — six of them, stacked on top of each other.

The Damage Report — 6 Sources Found
Docker container logs0.6–6 GB / batch
Docker's default log driver writes every line of container output to a JSON file on the SSD. No rotation, no cap.
Host temp filesVariable
Python's tempfile calls landing on the SSD by default. 63 GB of empty RAM at /dev/shm sitting right there, unused.
SQLite fsyncs600 fsyncs / batch
600 sync-to-disk operations per batch for tracking results. Each one forces the SSD to flush its write cache.
pip package installs96 GB / day
160 MB per container × 600 containers. Every container installed Python packages. Every install wrote to the SSD.
Container lifecycle~2.4 GB / create+destroy
Docker's own overhead for spinning containers up and tearing them down. 600 cycles a day adds up fast.
My "protection" script840+ GB / day
A script I wrote to protect the SSD was actually the single largest source of writes on the entire machine.
That last one dwarfs everything else combined. The script I wrote to protect the SSD was responsible for more writes than all other sources put together.

The Easy Wins: Silencing the Noise

Before tackling the monster, I went after the low-hanging fruit. Four changes, each one simple, each one measurable:

Docker logs → silenced. One flag — --log-driver=none — tells Docker to stop writing container output to disk. I was already capturing output through other means, so the log files were pure waste. Savings: up to 6 GB per batch.

docker run --log-driver=none ...

Host temp files → redirected to RAM. Python was creating temporary files on the SSD by default. Changing the default temp directory to /dev/shm is a directory that lives entirely in RAM. Files written there never touch the SSD — they're stored in memory and vanish when the machine restarts. Most Linux systems have it, most developers never think about it, and most programs write temporary files to the SSD instead. On my system, 63 GB of /dev/shm was sitting empty while the SSD was getting hammered. (/dev/shm/trw-eval) was a one-line configuration change.

pip installs → RAM-backed. Every container installed Python packages to the SSD. Docker's --tmpfs flag redirects those writes to RAM. 160 MB per container, 600 containers per day — that's 96 GB of SSD writes eliminated with one flag.

docker run --tmpfs /tmp/trw-pip:exec,size=256m ...

SQLite → batched writes. Switching to Write-Ahead Logging is a database mode where changes get written to a separate log file first, then batched together before being flushed to the main database file. Instead of 600 individual 'write this to disk RIGHT NOW' operations, the database collects them and writes once. Fewer fsyncs = fewer SSD writes = longer drive life. with relaxed sync settings cut 600 individual disk flushes down to batched writes.

These were the appetizers. The main course was still destroying the drive.

Docker Already Has a Safety Net

Here's the thing I should have understood before building anything on top of it: Docker was already handling write protection.

Docker stores your application as a stack of frozen, read-only layers — like transparencies stacked on an overhead projector. When code inside a container needs to change a file, Docker copies just that one file to a thin writable layer on top. Everything else stays frozen. This is called Copy-on-write means 'don't bother copying anything until someone actually changes it.' Imagine a library where every visitor reads the same physical book. But the moment someone wants to write in the margins, the library photocopies just that page for them. Everyone else still reads the original. That's Docker. 600 containers can share the same base files without making 600 copies. Brilliant system. I decided to outsmart it. Spoiler: I did not outsmart it. (explained well in the Docker storage driver docs), and it's beautifully efficient.

The frozen layers
My “improvement”
RAM layer — catches writes perfectly. But the setup already hammered the SSD.
rename copies EVERYTHING to SSD first
Docker's writable layer
SSD — only changed files land here. 50 KB per agent, not gigabytes.
only changed files copied up
Your App Code
/testbed — the code each agent works on
frozen
Python + Packages
pip, conda, dependencies
frozen
Base Operating System
Ubuntu 22.04
frozen
Docker stores your application like a stack of transparencies on an overhead projector. Base operating system on the bottom, then Python, then your code on top. These layers are frozen — nothing can change them. All 600 containers share the same frozen copy. This is already efficient.
1 / 3

An AI agent that edits 10 files generates about 50 kilobytes of SSD writes. The frozen layers never get touched. Docker handles it.

But I didn't trust it. "What if there are writes Docker doesn't catch? What if the overhead is worse than I think?" I had concerns. I had theories. What I didn't have was data.

So I built a workaround.

My "Shield" Was Swinging the Sword

The idea sounded bulletproof: intercept every single write inside the container and redirect it to RAM (Random Access Memory) is your computer's short-term memory. It's blazing fast, it doesn't wear out from writing, and it vanishes the moment you lose power. Think of it as a whiteboard — perfect for temporary work, terrible for permanent storage. I had 128 GB of it, and I figured: why let containers write to the SSD when I can catch everything in RAM? instead of the SSD. Writes to RAM don't touch the drive at all. No wear, no problem.

I wrote a setup script — about a page and a half of shell commands. It ran inside every single container before the AI agent started. Here's the core of it:

# Rename the code directory so we can mount a RAM layer on top
mv /testbed /testbed.ro
 
# Set up a RAM-based overlay to catch all writes
mount -t overlay overlay \
  -o lowerdir=/testbed.ro,upperdir=/ramtmp/testbed-upper \
  /testbed

Then 50+ more lines of setup: creating mount points, configuring timeouts, handling edge cases. Every container needed The --privileged flag in Docker is the 'I trust this container with everything' switch. Normally, containers are locked down — they can't access the host machine's hardware, mount filesystems, or do anything dangerous. --privileged rips all those guardrails off. It's necessary for some legitimate use cases, but security teams hate it because a compromised privileged container can do anything to the host machine. My workaround required it. My fix didn't. to run these filesystem operations. There was even a verification document confirming everything was "properly configured."

It was thorough. It was documented. And it was the single biggest source of SSD writes on the entire machine.

The Trap: Copy-Up

Remember the first line of the setup script?

mv /testbed /testbed.ro

Looks harmless. Just renaming a folder. On a normal computer, renaming is instant — the operating system just updates a label.

But inside a Docker container, renaming triggers something called Copy-up is Docker's hidden trap door. Remember those frozen read-only layers? When you rename a file that lives in a frozen layer, Docker can't just update a label — the frozen layer is immutable. Instead, the kernel has to physically copy every single file from the frozen layer into the writable layer before it can rename anything. A 'rename' that should take milliseconds can copy hundreds of megabytes to your SSD. The kicker: you'd never know unless you measured, because the command completes instantly from the user's perspective..

The Copy-Up Trap
The theory: writes go to RAM, $600 SSD stays untouched
RAM overlay
Agent writes land here. Fast. No SSD wear.
writes go up to RAM
Frozen image layers
/testbed (340 MB) + /opt (1-12 GB)
read-only
Zero SSD writes. $0 in wear. That was the theory.

When I renamed /testbed, the kernel physically copied every file in that directory from the frozen layer to the writable layer. On the SSD. 340 megabytes per container.

And /testbed wasn't even the big one. The script also reorganized /opt — home of Python and all its packages. That copy-up ranged from 1 to 12 gigabytes. Per container.

Then the RAM overlay mounted on top, and it worked perfectly. Every subsequent write went to RAM. Exactly as designed. The irony is almost poetic: the protection system worked flawlessly — after it had already caused all the damage it was designed to prevent.

600 containers × 1.4 GB of copy-up overhead = 840 GB. Every day. All of it unnecessary.

Feel the Scale

Drag the slider. Watch the numbers climb.

Write Impact Calculator
600
1600
Per container
1.4 GB
Total SSD writes
840.0 GB
Reduction
Annual wear cost
$184
of a $600 drive
Before the fix: My “protection” script triggers a copy-up on every container startup — 340 MB of code plus 112 GB of Python packages, all copied to the SSD before the AI agent even starts. With 600 containers daily, that's $184/year in drive wear. From infrastructure, not from work.

Four Attempts, Three Failures, One Deletion

Once I understood the copy-up trap, fixing it still wasn't a straight line. I tried four different approaches over three days. Each one taught me something — mostly about hubris.

The Fix Journey
1.The "brilliant" RAM overlay
MADE IT WORSE

Intercept every container write and redirect it to RAM. Shield the SSD completely. It sounded bulletproof.

The shield was the sledgehammer. The setup renamed a directory, which copied everything to SSD first. My protection was the #1 source of damage.
2.Just skip the big folders
BAND-AID

Protect small directories, skip the big ones. Cut the worst copy-ups. Reduced writes a bit.

Classic symptom-chasing. The code folder still copied 340 MB per container. Root cause: still there.
3.Move all of Docker to RAM
REVERTED

Nuclear option: run the entire Docker storage in RAM. Zero SSD writes. Technically perfect. Operationally insane.

Every reboot re-downloaded every image. Two Docker systems running at once. Lasted 2 hours before I hit revert.
4.Delete everything I built
SHIPPED

Remove the entire protection system. Trust Docker to do what Docker was designed to do.

A page and a half of code became 2 lines. No security permissions needed. SSD wear dropped from $180/year to $4.
complexsimple

The first attempt was the protection system itself — the one causing the problem. The second was a band-aid that reduced writes but left the root cause untouched. The third was the nuclear option: run an entire I spun up a second Docker daemon — the process that manages all containers — with its storage on /dev/shm (RAM). Every container would write to RAM, not the SSD. Zero SSD writes. Technically perfect. But /dev/shm is temporary memory: every time the machine restarts, everything disappears. I'd have to re-download every container image on every reboot. I reverted it in two hours. in RAM, achieving zero SSD writes but making the system impossible to operate. Reverted that one in two hours.

The fourth attempt was the hardest one to accept: delete everything I'd built.

The Fix Was Deletion

The final solution wasn't a better version of the protection system. It was accepting that Docker's copy-on-write was already doing the job, and my entire protection layer was unnecessary overhead with a devastating side effect.

Before — a page and a half of setup, elevated security permissions, mount points, timeout polling:

mkdir -p /ramtmp && mount -t tmpfs -o size=12g tmpfs /ramtmp &&
mv /testbed /testbed.ro &&
mkdir -p /testbed /ramtmp/testbed-upper /ramtmp/testbed-work &&
mount -t overlay overlay -o lowerdir=/testbed.ro,... /testbed;
# ... 50 more lines of filesystem acrobatics ...

After — 2 lines:

touch /tmp/.overlayfs-ready && exec sleep {timeout}

That's it. Let Docker do what Docker does. An agent that edits 10 files generates 50 KB of writes. Not 1.4 GB of overhead to "prevent writes."

For the few things that genuinely write a lot, Docker's own --tmpfs flags handle it:

docker run \
  --log-driver=none \
  --tmpfs /tmp/trw-pip:exec,size=256m \
  --tmpfs /root:size=256m \
  ...

The --tmpfs flag tells Docker: 'mount a chunk of RAM at this path.' Any files written there go to memory, not the SSD. It's Docker's built-in version of exactly what my protection script was trying to do — except it works, it's one flag, and it doesn't require elevated permissions. Three of these flags replaced my entire protection system. replaced a page and a half of code and a security hole.

One More Trick: Stop Destroying What You Can Reuse

The last optimization was the simplest conceptually: stop creating and destroying 600 containers when you can reuse them.

Docker generates about 2.4 GB of SSD writes every time it creates and tears down a container. Multiply by 600, that's another 1,440 GB of daily writes. A A container pool keeps containers alive between uses instead of destroying and recreating them. When one task finishes, the pool resets the container to a clean state (git checkout, delete temp files) and hands it to the next task. The reset costs almost nothing in SSD writes. The full create/destroy cycle costs 2.4 GB. A pool cuts container lifecycle writes roughly in half. reuses containers across tasks — when one finishes, it resets (a cheap git checkout) instead of being destroyed and rebuilt.

The Results

Before vs After
Peak SSD write rate
4,380 MB/s
Per-container writes
1.4 GB
pip install writes
160 MB/container
Docker log writes
0.6–6 GB/batch
Annual SSD wear cost
~$180/year
Drive lifespan at this rate
3.3 years
Security permissions
--privileged
Lines of setup code
62
The SSD will now outlast the computer, my career, and — at this rate — probably my relevance in the job market. Although honestly, the AI agents I'm benchmarking are coming for that anyway.

The Twist I Didn't Expect

Here's the part I didn't see coming: fixing the SSD wasn't just about hardware preservation. It was about data quality.

The SSD thrashing had been causing Out of Memory kill — when the Linux kernel decides a process is using too much memory, it sends signal 137 (SIGKILL) to terminate it immediately. No warning, no cleanup, no graceful shutdown. The process just dies. In my system, containers were getting killed not because they actually ran out of memory, but because the I/O pressure from SSD thrashing was creating cascading resource exhaustion. The containers looked like they failed to solve the problem. They actually failed to survive the infrastructure. in 14.3% of evaluation runs. Containers were being terminated by the kernel — not because they ran out of memory, but because the I/O pressure from SSD thrashing was creating cascading resource exhaustion.

Those kills looked like the AI agent had failed to solve the problem. They hadn't. The infrastructure had killed them before they had a chance.

After fixing the SSD thrashing, baseline evaluation reliability jumped 34 percentage points. Not from improving any AI model. Not from better prompts. From fixing infrastructure. That 34-point improvement was the single largest accuracy gain in the entire project — larger than any model tuning, any prompt optimization, any architectural change.

I spent weeks tuning AI model configurations for a few percentage points of improvement. Then I fixed the SSD and got 34 points for free.

The lesson hit hard: infrastructure quality is research quality. You can't measure what your system is doing if the system itself is interfering with the measurement. I was trying to evaluate AI coding agents while the infrastructure was silently killing them and blaming them for the failure.

What I Took Away

Measure before you build. I had a verification document. I had code comments. I had a design review. None of it mattered because nobody measured the actual writes hitting the drive. A 53-line monitoring script, written in the time it takes to make a sandwich, found six sources of waste that weeks of review missed. Instruments don't lie. Documents can.

Don't outsmart your tools. Docker's copy-on-write exists for exactly this scenario. It already minimizes writes. Building a protection layer on top of Docker's built-in protection was like bringing a backup parachute — except this backup parachute was made of bricks. Understand what your tools already do before adding to them.

Death by a thousand writes. The SSD wasn't being killed by one big leak. It was six different sources, stacked on top of each other. Logs, temp files, pip installs, container lifecycles, the overlayfs copy-up trap — each one seemed minor in isolation. Together, they were writing over a terabyte per day. The lesson: when you're running at scale, "minor" writes multiply into major problems.

Infrastructure failures masquerade as application failures. 14.3% of my evaluation runs were "failing" because the infrastructure was killing them, not because the AI couldn't solve the problem. Without measuring at the infrastructure level, I was blaming the wrong layer. How many "flaky tests" in production are actually infrastructure problems wearing an application mask?

The best optimization is often deletion. The fix wasn't better code. It was recognizing that the code shouldn't exist at all, swallowing my pride, and deleting it. The hardest part wasn't the engineering. It was admitting the engineering was the problem.

Sometimes the most sophisticated decision an engineer can make is to stop engineering.

This post is part of my engineering blog where I write about the real debugging stories behind production systems. If this kind of infrastructure deep-dive interests you, check out how I diagnosed a GPU firmware bug on the same machine.

Frequently Asked Questions

What is Docker copy-up and why does it cause SSD writes?

Docker stores your application as frozen, read-only layers. When code inside a container needs to change a file, Docker copies just that file to a writable layer — efficient and lightweight. But if you rename or move an entire folder, Docker has to copy every file in that folder to the writable layer all at once, because the frozen layer can't be modified in place. A rename that feels instant actually triggers a massive copy operation to the SSD. In my case, renaming a single directory copied 340 MB per container — and I was running 600 containers a day.

How do you reduce Docker container SSD writes?

Trust Docker's built-in copy-on-write — it only writes files that actually change. Beyond that: use --log-driver=none if you capture output separately. Use --tmpfs flags for heavy-write paths like package installs. Redirect your host's temp directory to /dev/shm (RAM). Reuse containers with a pool instead of destroying and recreating them. And never run mv on large directories inside containers — it triggers a full copy-up that can write gigabytes to the SSD from a command that looks instantaneous.

How do you monitor SSD writes on Linux?

Linux tracks every disk operation in /proc/diskstats, updated in real-time. A short script that reads this file twice per second and calculates the difference gives you live write speed and drive utilization. It's free, it's built into every Linux system, and in my case it found over a terabyte per day of unnecessary writes that weeks of code review and a verification document completely missed.

How much does SSD thrashing actually cost?

A $600 enterprise SSD rated for 1,000 TBW costs about $0.60 per terabyte of writes. At 840 GB/day from the overlayfs copy-up alone, plus logs, pip installs, and container lifecycle overhead, I was spending roughly $0.50/day or $180/year in drive wear — burning 30% of the drive's lifespan annually. After optimization, the same daily workload costs about $4/year. The drive will now outlast the machine it's installed in.