The Long Tail of the AWS Outage

Spread the love

An extensive Amazon Web services cloud outage That began Monday morning illustrated the fragile interdependence of the Internet as major communications, financial, healthcare, education and government platforms were disrupted around the world. as day wearAWS diagnosed and began working to correct the issue, which originated from the company’s critical US-EAST-1 region located in Northern Virginia. But the cascade of effects took time to fully resolve.

Researchers reflecting on the incident particularly highlighted the length of the outage, which began on Monday, October 20 at 3 a.m. ET. AWS said in a status update that “all AWS services have returned to normal operations” by 6:01 pm ET Monday. The outage originated directly from Amazon’s DynamoDB database application programming interface and, according to the company, “affected” 141 other AWS services. Multiple network engineers and infrastructure experts stressed to Wired that failures are understandable and inevitable for so-called “hyperscalers” such as AWS, Microsoft Azure and Google Cloud Platform, given their complexity and sheer size. But they note that this reality shouldn’t simply disable cloud providers during extended downtime.

“The word Hindsight It’s easy to figure out what went wrong after the fact, but AWS’s overall reliability shows how difficult it is to prevent every failure,” said Ira Winkler, chief information security officer at reliability and cybersecurity firm CYE. “Ideally, this will be a lesson learned, and Amazon will implement more redundancies that will prevent such disasters from lingering in the future. They did.”

AWS did not respond to WIRED’s questions about the long tail of recoveries for customers. An AWS spokesperson said the company plans to release a “post-event summary” of the incident.

“I don’t think this was just a ‘staff happening’ outage. I would have expected a full remediation very quickly,” said Jake Williams, vice president of research and development at Hunter Strategy. “To give them their due, cascading failures aren’t something they have a lot of experience working with because they don’t have outages that often. So that’s to their credit. But it’s really easy to get into the mindset of giving these companies a pass and we shouldn’t forget that they created this situation by actively trying to attract more customers over what they control themselves over their infrastructure or what they control themselves. Financially.”

The incident was caused by a known culprit in web outages—“domain name system” resolution problems. DNS is essentially the Internet’s phonebook mechanism that directs web browsers to the correct servers. As a result, DNS problems are a common source of outages, as they can cause requests to fail and prevent content from loading.

Leave a ReplyCancel Reply

Trending now