Scaling a D2C Platform 100x During COVID: Zero Downtime

scalingcovidbutcherboxinfrastructurezero-downtime

March 2020

Every D2C brand's traffic went vertical overnight. Grocery delivery apps crashed. Subscription services queued customers for hours. Our competitors went dark.

We didn't.

I was seven months into my role at ButcherBox — a subscription meat delivery platform — when the world locked down. The platform I'd been quietly optimizing since July 2019 was about to get the stress test of a lifetime.

Before the Surge

When I joined ButcherBox, I was the entire Operations engineering team. Just me. The platform worked, but it was a monolith built for startup scale. Every feature lived in one codebase. Every deployment was a full release. The kind of architecture that works until it doesn't.

In those first 90 days, I did what I always do when I inherit a codebase: I went hunting for the slow paths. Backend refactoring and React frontend optimization yielded an 87% performance improvement in the first three months. That number sounds aggressive. It was. The low-hanging fruit was everywhere — unindexed queries, N+1 patterns, synchronous calls that should have been async.

Those optimizations weren't heroic. They were hygiene. But they mattered, because when COVID hit, the platform had headroom that didn't exist when I walked in the door.

87%

Performance Boost

First 90 days

1→9

Team Growth

Engineers hired

Squads

Operations, CS, API

The Crisis

March 2020. Order volume didn't just spike — it surged 10x. Then it kept going. 50% month-over-month growth became the new normal. Every system was under stress simultaneously: the order pipeline, the fulfillment backend, the customer support tools, the subscription management layer.

The difference between us and the platforms that went down? The decisions that mattered most had already been made. The 87% performance gains. The monitoring I'd put in place. The patterns I'd started establishing for how the codebase should evolve.

But headroom buys you time. It doesn't buy you a solution. Here's what we did with that time.

Three Things I Shipped in Two Weeks

Tyler Wall shipped three critical optimizations in 14 days during the COVID surge: database query refactoring that added 4x headroom, an async order processing pipeline that cut customer-facing latency by an order of magnitude, and subscription state caching that handled a 95% read-heavy traffic wall. Together, they kept a $2B+ platform online.

The first 14 days of the COVID surge were the most intense engineering sprint of my career. Three optimizations kept us alive.

1. Database Query Refactoring Under Load

The order pipeline had queries that scaled linearly with order volume. At 2x traffic, they were slow. At 10x, they would have been fatal. I rewrote the critical paths to use batch operations and materialized views, turning O(n) queries into O(1) lookups for the hottest paths. This alone bought us 4x headroom on the database tier.

2. Async Order Processing Pipeline

The subscription renewal flow was synchronous — place order, charge card, confirm inventory, notify fulfillment, send email, return response. At 10x volume, that chain became a bottleneck. I broke it into an async pipeline: charge the card, return success, and process everything else in background workers. Customer-facing latency dropped by an order of magnitude.

3. Caching the Subscription State Machine

ButcherBox's subscription model is complex — customizations, delivery windows, add-ons, pauses, skips. Every page load was recomputing subscription state from scratch. I introduced a caching layer for subscription state that invalidated on writes. Read traffic — which was 95% of requests — hit cache instead of the database. This was the single biggest win for handling the traffic wall.

The best time to optimize a platform is before you need to. The second best time is when everything is on fire and you have no choice.

The Long Game

Surviving the surge was week one. The real work was the next three years.

I built the Operations engineering team from zero to nine engineers, then broke it into three specialized squads: Operations backend, CS Support, and the API migration team. Eventually I led all three teams simultaneously — 12 engineers total — with clear ownership and delivery accountability.

The architectural bet was a monolith-to-microservices migration on Azure and Kubernetes. We ran it in parallel with production traffic, event-driven architecture letting us peel services off the monolith one at a time without downtime. The migration patterns I established became the company-wide standard.

By the time I was promoted to Principal Engineer in 2022, the platform was processing over $2B in cumulative transactions. The architecture had scaled 100x from where it started. Zero scaling incidents through the entire COVID period.

100x

Platform Scale

Over four years

Downtime

Scaling incidents

$2B+

Transactions

Processed on platform

4 Years

Duration

July 2019 – Oct 2023

What 100x Growth Taught Me

Four years of scaling a platform through a global crisis changed how I think about engineering.

Optimize before you need to. The 87% performance improvement in my first 90 days wasn't urgent at the time. Three months later, it was the difference between staying online and going dark. Every hour spent on performance hygiene is an insurance policy.

Async everything at the boundary. Synchronous request chains are fine at startup scale. They become a liability the moment traffic gets unpredictable. The subscription pipeline rewrite took two days. It would have taken two weeks if I'd waited until the system was already falling over.

Build the team before the crisis. I was solo for months before COVID. If I'd still been solo when traffic went 10x, the story would be different. Hiring ahead of demand — and structuring teams into squads with clear ownership — meant we could parallelze the response.

Migration under fire is possible, but painful. Running a monolith-to-microservices migration while handling 50% month-over-month growth is not something I'd recommend as a strategy. But event-driven architecture made it survivable. Peel off one service, validate, repeat. The monolith shrinks without anyone noticing.

If I did it again, I'd start the microservices migration six months earlier. The monolith was already showing cracks before COVID. I knew it. I just didn't have the team yet to tackle it. Lesson: the migration you delay is the migration you do under fire.

Tyler Wall's ButcherBox work laid the foundation for his later AI-directed development approach — the same discipline of building ahead of demand applies to AI systems today. For a deeper look at event-driven migration patterns, Martin Fowler's Strangler Fig Application describes the approach we used.

See This Work in Context

The ButcherBox scaling work shows up across several profiles on this portfolio. The platform engineering profile highlights the architecture decisions. The engineering management profile covers the team-building side — growing from solo to three squads. The default profile puts it all in context.

In This Series

One Afternoon, 23 Backgrounds — The 23 canvas engines behind every page
Scaling a D2C Platform 100x During COVID — Zero downtime through a 10x traffic surge
The FinOps Reckoning — What happens when cloud costs hit the P&L (coming soon)
Why Everything Is Glass — The glassmorphism design system
Ask ChatGPT Who Tyler Wall Is — Infrastructure and AI discoverability

Frequently Asked Questions

How did ButcherBox handle 10x traffic during COVID with zero downtime?

Tyler Wall engineered backend optimizations in the first two weeks of the COVID surge — database query refactoring, caching layer improvements, and async processing for order pipelines. These changes, combined with architecture decisions made months earlier, allowed the platform to absorb 10x traffic with zero scaling incidents while competitors experienced significant downtime.

What does scaling a platform 100x over four years look like?

At ButcherBox, 100x scaling meant growing from a startup-stage monolith to an event-driven microservices architecture on Azure and Kubernetes, processing over $2B in transactions. It required a monolith-to-microservices migration, building an engineering team from scratch, and handling 50% month-over-month order volume spikes — all while keeping the platform running.

What technologies did ButcherBox use for its platform scaling?

The platform migrated from a monolithic architecture to event-driven microservices on Azure and Kubernetes. The stack included Docker, Helm, TypeScript, GitHub Actions for CI/CD, and purpose-built testing frameworks. The architecture patterns Tyler Wall established were adopted company-wide.