Netflix actually added the additional AZs because of a prior outage that did take them down.
"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."
https://www.networkworld.com/article/3178076/why-netflix-did...
We went multi-region as a result of the 2012 inc. source: I now manage the team responsible for performing regional evacuations (shifting traffic and scaling the savior regions).
That sounds fascinating! How often does your team have to leap into action?
We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.
Do you do any chaos testing? Seems like it would slot right in, there.