At a high level, the issue stemmed from two programs competing to write the same DNS entry – essentially a record in the internet’s phonebook – at the same time, which resulted in an empty entry. That threw multiple AWS services into disarray.
That “empty page” brought down AWS’ DynamoDB database, creating a cascading effect that impacted other AWS services like EC2, which offers virtual servers for developing and deploying apps, and Network Load Balancer, which manages demands across the network. When DynamoDB came back online, EC2 tried to bring all of its servers back online at once and couldn’t keep up.
Amazon is making a number of changes to its systems following the outage, including fixing the “race condition scenario,” which caused the two systems to overwrite each others’ work in the first place, and adding an additional test suite for its EC2 service.