I'm going to have to respectfully disagree with a big chunk of this article. Documentation is generally a waste of time unless you have a very static infrastructure, and run books are the devil.
You should never use a run book -- instead you should spend the time you were going to write a run book writing code to execute the steps automatically. This will reduce human error and make things faster and more repeatable. Even better is if the person who wrote the code also writes the automation to fix it so that it stays up to date with changes in the code.
At Netflix we tried to avoid spending a lot of time on documentation because by the time the document was done, it was out of date. Almost inevitably any time you needed the documentation, it no longer applied.
I wish the author had spent more time on talking about incident reviews. Those were our key to success. After every event, you review the event with everyone involved, including the developers, and then come up with an action plan that at a minimum prevents the same problem from happening again, but even better, prevents an entire class of problems from happening again. Then you have to follow through and make sure the changes are actually getting implemented.
I agree with the author on the point about culture. That was absolutely critical. You need a culture that isn't about placing blame but finding solutions. One where people feel comfortable, and even eager, to come out and say "It was my fault, here's the problem, and here's how I'm going to fix it!"
'Documentation is generally a waste of time unless you have a very static infrastructure'
I definitely agree with that, and it's partly a corollary of 'documentation is expensive and requires costly maintenance'.
Run books/checklists are mostly implemented really really badly.
Automation is the ideal, but is costly, and itself requires maintenance.
Most of the steps we had to perform did not lend themselves to automation, also.
> Automation is the ideal, but is costly, and itself requires maintenance.
I would contend that the cost of automation is about the same as the cost of documentation plus the cost of having to manually do the work over an over. It's just a cost borne up front instead of over time. But to your point in the article, you have to have a culture that supports bearing that up front cost.
> Most of the steps we had to perform did not lend themselves to automation, also.
I don't understand how that is possible? Could you give an example of a task that can't be automated?
> I don't understand how that is possible? Could you give an example of a task that can't be automated?
"Check the logs for service X (they're here ) and look for anything related to the issue"
"If the user impact is high, write an update to the status page detailing the impact and an estimated time to recovery"
The value of a runbook is that it can make use of human intelligence in its steps. No-one is arguing that you shouldn't be automating things like "if the CPU usage is > 90%, spin up another instance and configure it".
> "Check the logs for service X (they're here ) and look for anything related to the issue"
I have a long missive about how logs are useless and shouldn't be kept, but that's for another time. I'll summarize by saying that if you have to look at logs, then your monitoring has failed you.
> If the user impact is high, write an update to the status page detailing the impact and an estimated time to recovery"
I guess technically that would be a step in a runbook, that's fair. Although in my case that was left to PR to do based on updates to the trouble tickets. :)
> The value of a runbook is that it can make use of human intelligence in its steps
I'd rather human intelligence be spent on triage by reading the results of automated diagnosis and coding up remediation software than on repeating steps in a checklist.
Sure, there are uses for checklists of things to check, but even that should be automated through the ticket system at the very least, which I no longer consider a runbook, but I guess some might still consider that a runbook.
I have a long missive about how logs are useless and shouldn't be kept, but that's for another time. I'll summarize by saying that if you have to look at logs, then your monitoring has failed you.
How does that work?
First we have to make the distinction between logs and metrics. Logs are unstructured or loosely structured text, whereas metrics are discrete items that can be put into a time series database.
If you emit metrics as necessary to a time series database, then you should be able to build alerting based on the time series metrics. Your monitoring systems should be good at building alerts based on a stream of metrics and visualizing the time series data.
Sometimes you might have to look at the visualizations to find something, but ideally you then set up an alert on the thing you looked at so you have the alert for the next time it happens. A great monitoring system lets you turn graphs into alerts right in the interface, so if you're looking at a useful graph you can make an alert out of it.
Sometimes logs can be useful, but only after your monitoring system has told you which system is not behaving, and then you can turn on logs for that system until you've solved the problem, but you shouldn't need access to old logs, because if the problem was only in the past, then it's not really a problem anymore, right? If you have an ongoing problem, then maybe have the logs on for that service while you're investing that problem, but then turn them off again.
But having a ton of logs always generating and being stored tend to be fairly useless in practice with a good time series database at hand.
Logs have a much, much, much lower barrier to entry than a fully-complete time-series monitoring system that covers everything.
Likewise, turning logs on only after you've seen a problem means you miss out on troubleshooting the root cause of it - if there was a spike of badness this morning but you don't have logs for it, you're missing out on diagnostic information that may have protected you from repeats of that spike in future.
I've also had business guys want to analyse things like access logs in ways that they didn't know previously. Logs provide a datastore of historical activity, which in smaller shops is a cheap data lake.
Perhaps the 'no logs' thing works for your setup, but I think it's bad general advice. And your position is not that logs are useless ("turn on logs for that system until you've solved the problem"), but that retaining logs are useless - quite a significant difference between the two.
> Logs have a much, much, much lower barrier to entry than a fully-complete time-series monitoring system that covers everything.
A monitoring system has a lower barrier to entry.
http://datadoghq.com/ => will do ALL of that and much more. You can deploy it in a few hours to thousands of hosts, no problem.
Direct competitor: http://signalfx.com/
Have no money to pay for high quality tool? graphite + statsd will do the trick for basic infrastructure. However it's single host, doesn't scale and only basic ugly graphs are supported.
That's what Grafana [0][1] is for -- i.e. creating nicer displays for Graphite.
However it's single host, doesn't scale
It may take some effort, but it can be done, and much of the heavy-lifting seems to have been done and been made available as open-source.
Here's a blog post from Jan. 2017 [2] from a gambling site about scaling Graphite.
And here's a talk [3] from Vladimir Smirnov at Booking.com from Feb. 2017 about scaling Graphite -- their solution is open-source (links in the talk and slides available at the link):
This is our story of the challenges we’ve faced at Booking.com and how we made our Graphite system handle millions of metrics per second.
(And this [4] is an older, but more comprehensive, look at various approaches to scaling Graphite from the Wikimedia people with the pros and cons listed).
[1] https://github.com/grafana/grafana
[2] http://engineering.skybettingandgaming.com/2017/01/13/graphi...
[3] https://fosdem.org/2017/schedule/event/graphite_at_scale/