This part is also interesting:

> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:

> We build our systems with the assumption that things will occasionally fail

Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.