The tech giant has blamed the outage of its web services, which resulted in a downtime of many different websites, services and devices… on an engineer’s typo. The recent failure of a critical section of Amazon web services known as S3 (Simple Storage Solution) also failed websites such as Business Insider and Medium. Besides, many people found they could not even turn on their online-connected lighting, because the intermediate service was also unavailable.
The company investigated the problem and found out that at the time of the outage one of its engineers was trying to figure out why the system’s billing service was running not fast enough and took a small subset of the servers for a subsystem involved in billing offline for inspection. To do so, the engineer executed a command from Amazon’s “established playbook”, but unfortunately for everyone, one of the inputs to the command was entered incorrectly and removed a larger set of servers than intended – including those supporting two other subsystems, one of which managed the metadata and location information of all objects of Amazon web services in the region.
It must be clarified that Amazon’s web services are built with redundancy in mind, which allows some elements to fail without taking out the whole system. But a human error accidently taking the wrong servers offline apparently entailed a cascade of more major failures.
The problem was aggravated by the fact that the company hasn’t rebooted the indexing system part of its web services relies on for years. Amazon apologized and admitted that it had not completely restarted the index subsystem or the placement subsystem in its larger regions for many years. It explained that its web services have experienced massive growth over the last few years, and therefore restarting them and running the required checks to validate the integrity of the metadata was taking much longer than expected.
Only Amazon’s Northern Virginia region suffered an outage die to this issue, but it entailed major problems for websites and services using the data center located in that region. The tech giant brought its apology to all customers and assured that it had put schemes in place to avoid such problems caused by human error in the future.