Maybe it would be a good idea for certificates to expire slowly and randomly over 24 or 48 hours. In other words, if the cert has an expiry date of 12:00 UTC, Dec 6th 2018, then start to randomly fail connections at that time with low probability. The probability increases progressively during the next 24 hours until 100% of connections fail at 12:00 UTC, Dec 7th 2018. It's not like the cert is 100% trustworthy one minute and 100% untrustworthy the next minute. Having the failure rate ramp up slowly would give advance warning before everything has gone completely pear-shaped.

In the case of Ericsson, this might have allowed an emergency certificate update before all the O2 systems could no longer be automatically updated. Once your network is completely down, bringing it back up remotely is hard.

I think this could also be a good idea for phasing out public APIs -- instead of just taking an API offline, start to fail requests early at low probability, ramping the probability up to 100% over the course of a month or so.

anilakar

How long would it take before developers started wrapping their API and CDN calls in rapid-firing loops because of this?

doesnt_know

It already exists and has borrowed the name "resilience engineering" from the construction and engineering fields. Netflix has some interesting blog posts on how they deal with transient faults and resilience in general. Implementing concepts like circuit breakers.

Have a search for libraries in your favorite language, I'm sure something will already exist. I've personally used Polly in .NET.

https://github.com/App-vNext/Polly