Podcast Blog: When AWS Broke the Internet: What Really Happened and How to Prepare

When half the internet went offline because one single DNS record in one single AWS region got overwritten at the wrong millisecond, it was a very clear reminder for the entire world: even the most advanced, hardened, redundant, planet-scale infrastructure on Earth is still just infrastructure.

It’s not magic. It’s not a guarantee. It’s someone else’s computer, and someone else’s software, with someone else’s race conditions, someone else’s delays, and someone else’s shortcomings.

That’s why this AWS outage shook so many people.

In this episode of SaaS That App: Building B2B Web Applications, Daniel Cannon, CIO at Delta Systems, sat down with Aaron Marchbanks and Justin Edwards to walk through what happened, why it cascaded, and what builders of software should actually take away from it.

Quick caveat: this isn’t an AWS-bashing article. AWS is brilliant. AWS has some of the most talented engineers in the world. But no amount of brilliance prevents the reality that complex systems fail in complex ways.

What Actually Broke?

To oversimplify: an internal AWS worker overwrote DNS routing rules for DynamoDB in the US East 1 region.

DynamoDB isn’t a database you use if you want. Internally, inside AWS, DynamoDB is foundational infrastructure. Credential, metadata, service discovery, and AWS itself relies on DynamoDB behind the scenes.

When the DNS entries for DynamoDB got corrupted, services in US East 1 couldn’t route to DynamoDB. As a result, AWS services couldn’t route to AWS services. Everything else was just fallout.

This wasn’t a global DNS dying. It was more subtle, more hidden, more insidious. This was an internal race condition. Two workers thought they had the right to write the new DNS rules. One experienced latency and arrived late. But its permission check happened before the latency, not after. So when it finally arrived, it wrote stale data and, effectively, nuked routing.

That tiny nuance in ordering alone pulled Netflix, Spotify, and half the internet to their knees.

The Post-Outage “Zombie Day” Was Worse

Even though AWS fixed the root cause in a couple of hours, the carnage wasn’t over.

DynamoDB is heavily used for batch processing. So job queues piled up. Jobs that should have been fired at normal load over a 2-hour window all burst back like a firehose the moment things resumed. Services started running at 120-150% capacity, trying to clear the backlog.

That’s why even after the root fix, everything still felt down. This is the part most founders and engineering leaders don’t understand:

Downtime has two deaths: the outage itself, and the storm surge afterward. And if your system isn’t architected for both, your time to normalcy can be days, not hours.

Why Most Companies Don’t Architect for This Level of Robustness

Even though multi-region deployments exist, almost nobody does them. It’s not that we don’t know how. It’s that almost nobody is willing to pay for it.

Small businesses don’t need to operate like Facebook. And even mid-market enterprises usually aren't willing to triple cloud spend just to guard against a once-every-few-years edge-case in a specific AWS region. So every decision about resiliency is a business trade-off.

“What happens if we’re down for 2 hours?”
“How much money is that worth investing to prevent?”
“What is mission-critical vs merely annoying?”

This calculus matters more than most founders think. And if you don’t have this conversation before the outage, you will have it during or after.

How to Actually Diagnose Before You Fix

One of the biggest takeaways Daniel pushed in the episode: Don’t rush to fix when you don’t understand what broke. When big outages happen, the instinct is panic-patching. That is exactly how people make outages longer. Calm heads collect information. Because if you fix the wrong thing, you just add new bugs while the house is burning.

How to Communicate Downtime Without Losing Trust

Most founders try to minimize, blame-shift, or hide. Terrible instinct. Your users don’t care whose fault it was. They care about:

Is my data safe?
Is someone actually fixing this?
When will I be able to work again?

AWS is actually world-class here. Their status messaging and post-event reports do one thing extremely well: they narrate the situation with clarity. You should copy that.

Don’t hide, communicate early, provide updates, and make your own decision, even if the cause is upstream.

Final Thoughts

The cloud isn’t magic. It’s still servers, software, and human-written code with human-written race conditions.

The question is not: “Will this happen again?”

It absolutely will. The question is: “Will you be blindsided again?”

Daniel’s Background

Daniel Cannon is the Chief Innovation Officer at Delta Systems and Founder and CEO of Strive DB, bringing a wealth of experience in modern web development frameworks and architectures. His expertise spans full-stack development, with particular depth in Ruby on Rails and modern JavaScript frameworks. Daniel's hands-on experience with both traditional and cutting-edge technologies, combined with his ability to evaluate technical trade-offs in practical business contexts, provides valuable insights for organizations navigating technology decisions.

Listen to the new episode on: