Amazon’s services are “recovering” as Snapchat and banks are among the sites affected by the outage

Spread the love

Liv McMahontechnology reporter and

Lily JamaliNorth American Technology Correspondent

Getty Images A woman climbs the stairs in front of a giant AWS sign. It's the three letters AWS with an arrow similar to Amazon's smiley face underneath.

Amazon Web Services (AWS) said late Monday that it has fixed a massive outage that took some of the world’s largest websites offline for much of the day.

More than 1,000 apps and websites – including social media platforms such as Snapchat and banks such as Lloyds and Halifax – were affected by problems that Amazon said were at the core of the cloud giant’s US operations.

Platform outage monitor Downdetector said user reports of problems worldwide rose to more than 11 million during Monday’s outage.

Even after Amazon fixed the underlying problem, experts said the outage demonstrated the dangers of having so many companies rely on a single dominant supplier.

“What this episode has highlighted is how interdependent our infrastructure is,” said Prof Alan Woodward of the University of Surrey.

“So many online services rely on third parties for their physical infrastructure, and this shows that even the largest of these third-party providers can experience problems.

“Small errors, often human-caused, can have a widespread and significant impact.”

The problems appear to have started around 07:00 BST on Monday, as users began reporting problems accessing multiple platforms.

This includes a wide range of different sites and services, from massive online games like Fortnite to the language learning app Duolingo.

Earlier in the day, Downdetector told the BBC it had seen more than four million reports from users across 500 sites in just a few hours – more than double the amount it would see in a normal weekday.

They later peaked at more than 11 million, he said, as more services, including Reddit and Lloyds Bank, tried to recover.

At around 23:00 BST, Amazon said all AWS services had “returned to normal operations”.

But not before the company had to throttle parts of its own system to deal with the underlying problem.

A new series of “cascading failures” may have occurred after the initial outage, according to Mike Chappell, a professor of information technology at the University of Notre Dame.

“It’s like when you have a major power outage. Crews start working to try and get it back on,” Mr Chappell said. “The power may flicker a few times,” he explained, but it’s possible that Amazon was initially “only addressing the symptoms” and not the cause.

What went wrong?

Amazon has yet to fully disclose what caused Monday’s outage or issue an official statement about it.

In an update on its service status webpage, it said the issue “appears to be related to DNS resolution on the DynamoDB API endpoint in US-EAST-1.”

DNS, which stands for Domain Name System, is often likened to a telephone directory for the Internet.

It effectively translates the names of the websites that people use (such as bbc.co.uk) into numbers that can be read and understood by computers.

This process is fundamentally at the heart of how we use the Internet, and interruptions in it can leave web browsers unable to find the content they are looking for.

Matthew Prince, CEO of Cloudflare, told the BBC that the AWS outage highlights the power cloud services have on the way the internet works.

“Everybody has a bad day, Amazon had a bad day today,” he said.

“There are amazing things about the cloud, it allows you to scale … but if you have an outage like this, it can take down a lot of services that we rely on.”

And Corey Crider, head of the Institute for the Future of Technology, told the BBC it was “a bit like a collapsing bridge”.

“A substantial part of the economy has fallen to pieces,” she said.

And with so much of cloud computing relying on Amazon, Microsoft and Google – estimated at around 70% – she said the status quo was “unsustainable”.

“Once you have concentrated supply in a handful of monopoly suppliers, when something like this fails, it takes a huge percentage of the economy with it,” she said.

“We should really try to buy more local services instead of relying on a handful of American monopoly platforms.”

“This is a risk to our security, our sovereignty and our economy, and we need to look at the structural divisions to make our markets more resilient to this kind of shock.”

Watch: The BBC’s Lucy Woodham asks Cardiff students about Snapchat disruption

One computer science expert says part of the responsibility lies with the companies that use AWS.

“Companies using Amazon haven’t taken enough care to build security systems into their applications,” said Ken Bierman, a professor of computer science at Cornell University in New York.

Outages like Monday’s happen frequently, though not always on this scale.

Biermann tells the BBC that app developers should make sure to invest in backing up mission-critical apps that live in the cloud.

“We know how to make these systems stronger, and we know how to do it securely,” Biermann says.

The issue of liability may end up in court.

More than a year after CrowdStrike’s massive outage, Delta Airlines is still fighting the company to recover more than $500 million in losses.

Even after CrowdStrike fixed the problem, the airline said it had to manually reset 40,000 servers, causing major flight delays for several days.

Additional reports by Esyllt Carr.

What went wrong?

Leave a ReplyCancel Reply

Trending now