Amazon mystery solved: A typo took down a big chunk of the Internet

Elizabeth Weise

USATODAY

SAN FRANCISCO — The major outage that hit tens of thousands of websites using Amazon's AWS cloud computing service on Tuesday ends up having been the result of a simple typo — just one incorrectly-entered command.

The four-hour outage at Amazon Web Services' S3 system, a giant provider of backend services for close to 150,000 websites, caused disruptions, slowdowns and failure-to-load errors across the United States.

Massive Amazon cloud service outage disrupts sites

Amazon's Simple Storage Service (S3) lets companies use the cloud to store files, photos, video and other information they serve up on their website. It contains literally trillions of these items, known as "objects" to programmers.

When the system was down, websites could not access the photos, logos, lists or data they normally would have pulled from the cloud. While most of the sites didn't go down, many had broken links and were only partly functional.

On Thursday Amazon published a public letter outlining what happened.

Here's the rundown:

On Tuesday morning, an Amazon team was investigating a problem that was slowing down the S3billing system.

At 9:37 am Pacific time, one of the team members executed a command that was meant to take a few of the S3 servers offline.

"Unfortunately," Amazon said in its posting, one part of that command was entered incorrectly — i.e. it had a typo.

That mistake caused a larger number of servers to be taken offline than they'd wanted. Two of those servers ran some important systems for the whole East Coast region, such as the ones that let all those trillions of files be placed into customers' websites.

To get it back, both systems required a full restart, which takes a lot longer than simply rebooting your laptop.

All of this wasn't just affecting Amazon's S3 customers, it was also hitting other Amazon cloud customers as well — because it turns out those systems use S3, too.

While Amazon says it designed its system to work even if big parts failed, it also acknowledged that it hadn't actually done a full restart on the main subsystems that went offline "for many years."

During that time, the S3 system had gotten a whole lot bigger, so restarting it, and doing all the safety checks to make sure its files hadn't gotten corrupted in the process, took much longer than expected.

It wasn't until 1:54 pm Pacific time, four hours and 17 minutes after the mistyped command was first entered, that the entire system was back up and running.

To make sure the problem doesn't happen again, Amazon has rewritten its software tools so its engineers can't make the same mistake, and it's doing safety checks elsewhere in the system.

Amazon apologized to its customers for the event, saying it "will do everything we can to learn from this event and use it to improve our availability even further."

Does Amazon control the Internet, or does it just feel that way?

Featured Weekly Ad