It started when we discovered that some of our build processes were unable to run due to failures in Docker.
My first thought was the usual: “Is it us?”. I checked Docker health and found an incident from their side. Two minutes later, Twitter/X confirmed it wasn’t just us. AWS was having a bad day, and a big chunk of the internet came along for the ride.
This post is a short reflection on that incident. We were lucky, our systems run in a few different regions and were not directly connected to the broken services. But there were many lessons to learn.
What actually went down
On the AWS website, they shared that the outage started with a DNS failure in DynamoDB, triggered a chain reaction across key AWS systems (EC2, Lambda, SQS), and later exposed a deeper problem in AWS’s internal network monitoring. It impacted 141 AWS services and lasted about 15 hours before everything stabilized. The detailed post-event report can be found here. If you want to learn more about how the incident happened, Gergely Orosz has a good deep dive in his newsletter, The Pragmatic Engineer.
The incident affected thousands of companies. When AWS went down, it wasn’t just a tech issue, it stopped businesses in their tracks. A few hours offline can mean lost sales, missed deadlines, and frustrated customers. Every major outage reminds teams to take a hard look at how fragile their setup might be.
The incident also highlighted the trade-off we’ve made with the cloud. Using AWS, Azure, or Google Cloud is fast, simple, and cheaper upfront compared to running your own data center. You don’t need to buy servers, manage power, or hire people to maintain racks. But the convenience comes with a cost: less control and higher dependency. When one cloud region fails, you wait for them to fix it. Running your own data center costs more in hardware and staff, yet you can decide how it’s built, tested, and recovered. Most teams choose the cloud because it speeds up development, removes the overhead of operation, but the AWS outage reminded us that saving time can mean giving up some control.
We also experienced several major incidents due to the Cloud provider in the past. I remember the Cloudflare outage we experienced. Our system was down because our DNS and CDN were tied to Cloudflare’s network, and when Cloudflare sneezed, everyone caught a cold. We realized we had no quick fallback; we just had to wait it out. Similarly, the AWS outage made companies large and small realize that “wait it out” shouldn’t be the only plan.
If your application, backend, databases, etc, are all tied to one provider and one region, that’s a single point of failure. And you can do it at your own risk. Being “production-ready” nowadays means being failure-tolerant by default.
So, what do we do about it? No cloud provider is perfect, AWS, Google, Microsoft, Cloudflare, you name it, they all have bad days. We can’t prevent outages, but we can prepare for them. Below are key strategies and techniques that we learned and implemented in our team to mitigate the impact of outages.
Key Strategies for Resilience
Here are approaches we’ve used and seen others use to stay online and recover quickly when cloud problems happen:
Deploy across multiple Availability Zones and Regions
- At a minimum, deploy your critical workloads in a multi-AZ setup (multiple data centers in one region) so an isolated data center failure won’t take you down. But given regional outages like this one, consider multi-region redundancy for truly critical systems.
- Whether you do active-active (serving traffic from two regions at once) or active-passive (warm standby in a second region), having a Plan B region can turn a regional outage into a minor performance problem rather than a full shutdown.
- Multi-region adds complexity (data replication, syncing configs, etc.) and cost, of course, but it increases resilience.
Design for Graceful Degradation
- Not every part of your application needs to work 100% for the core experience to survive. Identify which features can fail gracefully. Decouple your architecture so that if one service or dependency goes down, it doesn’t drag everything with it.
- In practice, this might mean using queues and async processing. If your database is having issues, perhaps you queue up writes for later, serve cached data, or switch to a read-only mode. The goal is to fail softly, users get a lighter version of your app instead of a blank error screen. We follow this approach in our app at ShopBack so that if our API gateway ever fails to respond for any reason, users can still access cached read-only data.
Regular Backups and Cross-Cloud Backups
- Outages often remind us of the value of backups. Make sure you regularly back up your data and store those backups in a different region or even outside your current Cloud provider.
- For example, you might periodically export critical data from your AWS database to an on-premises server or another cloud’s storage. That way, if AWS is down hard, you haven’t lost data, and you could even spin up a read-only instance elsewhere in a pinch. Define your RTO (Recovery Time Objective). Basically, how long you can afford to be down and how much data you can lose, and architect the infrastructure and system to meet those targets.
- A case study of this is the cyberattack on one of the biggest stock exchanges in Vietnam in 2024. In that case, the company kept all its backups in the same cloud environment. When attackers gained access, they froze both the main system and the backups, making recovery impossible. It took them the whole week to recover. If those backups had been stored elsewhere, the outcome could have been very different.
- The main downside of cross-cloud backups is cost and the extra effort needed to manage them.
Disaster Recovery Drills
- When an outage hits, every minute counts. You don’t want to be clicking around the console, spinning up resources in a panic. Use Infrastructure as Code and automated scripts for your failover plans. Set up processes that can quickly spin up your systems in another region or on a different provider when something goes wrong.
- Run Disaster Recovery Drills to practice this failover. For example, occasionally simulate “What if us-east-1 dies?” and see if your team can restore service from backups or switch DNS to your secondary site within your RTO window. Practice not only finds the gaps in your plan but also makes the real thing much less chaotic.
- Don’t forget to test your communications during such drills, ensure the team can communicate even if your primary systems (or Slack/email) are down.
- At ShopBack, we run this exercise every year and set tight goals for how fast we can rebuild our environment from scratch.
Monitoring and Observability of Dependencies
- One sneaky lesson from outages is that sometimes services you didn’t think were critical turn out to be. In this case, many teams were surprised to find that even services running in other AWS regions or seemingly independent systems failed because they secretly depended on something in us-east-1.
- We need to map out the dependencies. What happens if Service A can’t talk to Service B? Do you know all the external APIs, auth services, or resources that your stack relies on? Use monitoring tools to catch these hidden dependencies and set up alerts for any unusual latencies or failures in them. During normal times, this helps you catch issues early; during an outage, it can pinpoint which dependency is the weak link causing cascading failures.
- Also, keep an eye on your external and third-party services. After several incidents involving them, at ShopBack, we set up a dedicated Slack alert channel to monitor the health of these integrations separately.
Consider Hybrid and Multi-Cloud Approaches
- This is the big guns. Not every company can afford to do it. It’s expensive and requires extra effort and maintenance to manage well. “Multi-cloud” means using more than one cloud provider, while hybrid setups combine cloud services with your own on-premises servers. Both are often seen as ways to reduce reliance on a single provider.
- They’re not trivial to implement, but for some businesses, they can be worth it. The goal is to run some of your systems on another provider or your own servers so that if one fails, the others keep working.
Systems will have bugs, and incidents will happen. We can’t control when AWS or any provider has issues, but we can control our preparedness and response. The latest AWS outage reminded us of that. By learning from it and from our own war stories (yes, Cloudflare, I’m looking at you), we can keep our applications not just running on sunny days, but resilient through the storms.
Stay safe out there in the cloud! And next time AWS goes dark, may your systems stay comfortably in the light.
If you love my blogs, please subscribe for more engineering lessons on building reliable systems.
Discover more from
Subscribe to get the latest posts sent to your email.