After the initial disruption, Amazon took four hours to get all the systems back up and running.
Now we know how it happened.
On Tuesday morning, members of the S3 team were debugging the billing system.
The network is normally capable of handling the loss of numerous servers, however, the large number that went offline Tuesday led to a cascading failure of a large portion of AWS' network.
The wrong command didn't actually cause the outage, but it took down the wrong S3 servers, forcing a full restart.
The subsystems were important. One of those was the index subsystem, which handles metadata and location information for all S3 objects in the US-EAST-1 region. Without it, services that depend on it couldn't perform basic data retrieval and storage tasks. Anyway, the employee was trying to take a few servers offline to work on the billing issue but made a typo in the command and accidentally took a whole bunch of servers offline instead.
Amazon's S3 cloud hosting service - and the many popular websites and apps run off the S3 platform - were hobbled Wednesday, leading to consternation across the internet.
Amazon said S3 was created to be able to handle losing a few servers.
An outage in the company's Simple Storage Service or Amazon S3 resulted in hampering its clients' operations for more than three and half hours. It's also declaring war on typos.
"We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level", the company says. The Service Health Dashboard did display a message showing that S3 was having issues, but for a time all other services looked okay despite that they were not.
To a certain extent, AWS were lucky with the outage as it only affected the region of northern Virginia, but promised it has now put protection measures in place to prevent a similar incident from happening again. During that time, S3 was unable to accept requests from the list of clients that were affected by the particular servers that went down, Ars Technica reports.
Amazon concluded its explanation with this message: "Finally, we want to apologize for the impact this event caused for our customers".