> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.
Look at the language used though. This is saying very loudly "Look, this isn't the engineer's fault here". It's one thing I miss about Amazon's culture- not blaming people when system's fail.
The follow-up doesn't bullshit with "extra training to make sure no one does this again", it says (effectively) "we're going to make this impossible to happen again, even if someone makes a mistake".
Agreed, especially regarding the culture but isn't this pretty much the same explanation they gave a few years ago when something similar happened?
I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.
I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.
Pretty sure that one was a Microsoft Azure outage.
(Source: am a self-identified post-mortems connoisseur. :)
Do you by chance keep a public log of your postmortem collection :)?