The AWS US-EAST-1 October 2025 outage: 15 hours that re-opened the single-cloud debate
A DNS fault in DynamoDB took out 2500+ services for 15 hours on October 20. The technical story is small; the strategic story is whether anyone actually moves off single-cloud now.
On October 20, 2025, AWS US-EAST-1 — Amazon’s busiest region, in Northern Virginia — experienced a 15-hour cascade outage triggered by a DNS resolution failure in the DynamoDB API endpoint. Over 2,500 internet services went dark or degraded, including Snapchat, Alexa, Coinbase, Duolingo, Fortnite, Slack, Atlassian, Ring, Robinhood, and approximately 40% of the consumer internet’s interactive surface area.
Two days later Fireship published “US-EAST-1 is humanity’s weakest link” with the framing that has become the dominant narrative: “Over 2,500 internet services got wrecked by the most catastrophic cloud outage in history, courtesy of AWS. We’ll break down the technical details and explain how the world’s addiction to cloud computing brought us all to our knees.” The technical story is interesting. The strategic story — whether anyone actually moves off single-cloud topologies after this — is the more important question.
This piece walks through what failed, how Reddit’s devops community processed it in real time, and what the post-mortem actually changed about how serious engineers think about cloud concentration risk.
What technically failed
The official AWS post-incident documentation and the ThousandEyes timeline analysis converge on the same root cause:
- DNS resolution failure for the DynamoDB regional API endpoint in US-EAST-1. The exact mechanism involved a configuration push that caused DNS records for
dynamodb.us-east-1.amazonaws.comto fail resolution intermittently and then completely. - Cascading dependencies. A non-trivial portion of AWS’s own internal control plane depends on DynamoDB. When DynamoDB resolution failed, IAM token validation slowed, Lambda invocations queued, ECS task scheduling degraded, S3 metadata operations timed out. None of these services were technically “down” — they were starved of the dependency they needed to function.
- No clean isolation between data plane and control plane. This is the deeper architectural issue. AWS’s data plane (the services themselves) and control plane (the orchestration of those services) share too many dependencies in US-EAST-1 specifically because the region has accumulated 17 years of legacy “well, let’s put it in us-east-1 too” decisions.
- Customer workloads with implicit US-EAST-1 dependencies — even multi-region deployments — got hit because cross-region operations (DNS, IAM, Route 53 control plane) often route through US-EAST-1 by default.
Total impact: ~15 hours of degraded service before full recovery. No data loss. No security incident. Just a coordinated failure of dependencies that propagated wider than any single AWS team predicted.
The Reddit reaction in real time
The 785-upvote r/devops thread “Engineers everywhere are exiting panic mode and pretending they weren’t googling ‘how to switch clouds’” captured the mood as the outage cleared:
“Engineers everywhere are exiting panic mode and pretending they weren’t googling ‘how to switch clouds’ at 3am.”
That’s the line. The honest reaction wasn’t “this is the moment we go multi-cloud.” It was “this is the moment we consider going multi-cloud, then go back to bed, then continue paying AWS exactly as before next quarter.” The deeper 213-upvote thread “AWS outage today made us realize how fragile our Dev flow really is” carried the more useful sub-conversation about what specifically broke and what could have prevented it.
Top comments across both threads carried three recurring themes:
Theme 1: Internal services were as vulnerable as customer services.
“Our internal CI/CD couldn’t deploy because GitHub Actions was degraded. Our incident management tool was down because PagerDuty uses AWS. We literally couldn’t communicate the outage to our team because Slack was down. It was like watching dominoes fall in a hall of mirrors.” (paraphrased pattern across multiple comments)
Theme 2: Multi-region within AWS didn’t fully protect you.
“We’re in three regions. Our entire control plane is still us-east-1. We were down. The ‘failover’ we’d designed assumed AWS itself would still be functional in the other regions. It wasn’t, because the IAM tokens we needed to authenticate the failover required us-east-1.”
Theme 3: True multi-cloud is still too expensive and complex for most teams.
“Sure, multi-cloud sounds great in theory. In practice we’d need to rewrite half our deployment scripts, deal with different IAM systems, pay for cross-cloud data egress, and maintain expertise on two clouds. For a 15-hour outage every 2-3 years, the math doesn’t work.”
That last point is the one most working engineers I follow landed on. The Reddit zeitgeist after the outage was less “switch clouds” and more “design for cell isolation within AWS more aggressively, treat us-east-1 as a hot region not a default, and have a real plan for the 15-hour-outage scenario.”
What Fireship got right and what he glossed over
Fireship’s video framing — “the world’s addiction to cloud computing brought us all to our knees” — captured the popular narrative. It’s compelling. It’s also a bit reductive.
What Fireship got right:
- The technical cascade was real and ugly. A single DNS issue should not take out 2,500 services. The fact that it did points to fundamental tight coupling that AWS has been describing as “loose” for a decade. The video correctly highlights this.
- US-EAST-1 occupies a disproportionate amount of internet surface. Anywhere from 30-40% of consumer-facing services have at least one critical dependency in us-east-1, often via control plane operations even when the data plane is elsewhere.
- The outage was preventable with better isolation. AWS knows this. They’ve been publishing “cell-based architecture” papers since 2020. The implementation has been slow.
What Fireship glossed over:
- Multi-cloud isn’t a free lunch. The video implies the answer is “diversify.” Multi-cloud has real costs — operational complexity, cross-cloud egress fees, the need for teams that know two clouds well. For most teams, the rational choice is still “single cloud with better cell isolation,” not “two clouds.”
- AWS’s track record is actually pretty good. A 15-hour major outage every 2-3 years is roughly the same reliability as the major alternatives (GCP had a similar incident in 2024; Azure had multiple smaller ones in 2025). The grass on the other side is not greener.
- The “addiction” framing blames customers. It’s not addiction — it’s that AWS is genuinely the best single-vendor cloud for most use cases, and switching costs are real.
The reconciliation: what actually changed after October 20
Two weeks after the outage, the working consensus among engineers I follow:
Things people are actually doing differently:
- Reviewing US-EAST-1 dependencies. Audit your stack for everything that resolves to us-east-1 even when “the app” runs elsewhere. IAM. Route 53 control plane. CloudFront distributions. ACM certs. Even if you’re “in us-west-2,” check.
- Strengthening cross-region failover plans. Most teams discovered their failover plan assumed AWS itself was healthy in other regions. Update the assumption: AWS may be globally degraded for hours. Plan for that.
- Building isolation cells within AWS. Use multiple VPCs, multiple accounts, separated control planes. The cell-based architecture pattern is real and worth implementing — and most teams hadn’t.
- Adding non-AWS dependencies for some critical paths. Cloudflare for DNS, Fastly for CDN, a non-AWS database for the absolute-critical-path data. Not full multi-cloud — selective second-cloud for the things you can’t afford to lose with AWS.
Things people are not actually doing despite saying they would:
- Migrating to multi-cloud for the main application.
- Switching primary cloud away from AWS.
- Building a true active-active multi-cloud deployment.
- Significantly reducing AWS spend.
The October 20 outage was a wake-up call about cell isolation within AWS, not about cloud concentration risk overall. That’s the realistic settled-state.
What this says about cloud strategy in late 2025
The bigger pattern: single-cloud-first remains the rational default for most teams, but the case for “second cloud for critical-path operations” got measurably stronger. Not the full multi-cloud religion of 2017. The narrower pattern of “Cloudflare for the edge, AWS for everything else” — or “Hetzner for compute, AWS for managed services” — landed as the working compromise.
This connects directly to the cheap-VPS landscape we wrote up in October. The cheap-VPS resurgence in 2025 wasn’t just about cost. It was about diversifying the dependency on AWS. Engineers building production workloads on Hetzner CCX22 + Neon Postgres + Cloudflare are explicitly making bets that aren’t tied to AWS’s availability. October 20 validated those bets.
The longer arc: AWS is too big to fail and too embedded to leave, but the embedding pattern is shifting from “AWS-first” to “AWS-mostly with strategic second-cloud diversification for resilience-critical paths.” That’s a meaningful shift in cloud strategy, even if the dollar amounts moving away from AWS are small in aggregate.
What I’d actually change in your stack right now
If you’re running a team that ships software on AWS US-EAST-1 in late 2025, the post-October-20 audit:
- Check your DNS resilience. If your auth flow, your CDN, your customer-facing domain all resolve through Route 53 in us-east-1, you have a single point of failure. Cloudflare DNS or NS1 as a backup costs ~$20-100/month and removes a real concentration risk.
- Review your IAM cross-region behavior. Some IAM operations are inherently us-east-1-dependent. Know which.
- Pre-stage your incident communication outside AWS. If your status page, your customer email tool, and your team Slack are all on AWS, your communication breaks when AWS does. Have a backup channel.
- Test the 15-hour outage scenario in tabletop. Not just “AWS is down for 30 minutes.” A 15-hour outage means your runbook needs to handle staffing, customer comms, data backfill, and incremental recovery. Most teams have never thought through this duration.
- Don’t migrate clouds. Seriously. The grass isn’t greener; the cost of switching is real; the better lever is cell isolation within AWS.
The honest summary: October 20 changed the calculus on cell isolation more than it changed the calculus on cloud choice. That’s a slightly less exciting takeaway than “the cloud is broken, go multi-cloud now.” It’s also the one that’s actually true.
Sources
Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.
- YouTube Fireship — "US-EAST-1 is humanity's weakest link…" — Fireship
- YouTube Should we break up AWS over the us-east-1 outage? — Theo - t3.gg
- YouTube How a Tiny Bug Crashed AWS | DynamoDB us-east-1 Outage Explained — ByteMonk
- Docs ThousandEyes — AWS Outage Analysis: October 20, 2025 — ThousandEyes
- Blog r/devops — "Engineers everywhere are exiting panic mode and pretending they weren't googling how to switch clouds" (785 ups) — r/devops
- Blog r/devops — "AWS outage today made us realize how fragile our Dev flow really is" (213 ups) — r/devops
- Firsthand Two production deployments in different AZ topologies during and after the October 20 outage window