A few things I have learnt along the way:

Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.

Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)

Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

Alerts must be actionable.

Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

Yeah nah, but, okay, nah yeah.

Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.

Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.

If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.

When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.

The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.

> if you're using the Dropwizard Metrics library, for example, you've already lost.

Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.

I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.

I think they just mean the host is aggregating, so any further aggregation is compounded slant time the data. Like StatsD’s default is shipping metrics every 10s, so if you graph it and your graph rolls up those data points into 10 minute data points (cuz you’re viewing a week at once), then you’re averaging an average. Or averaging a p95. People often miss that this is happening, and it can drastically change the narrative.

Yes, exactly this. It's the fact that you're doing aggregation in two places. Since you're always going to be aggregating on the backend, aggregating in the app is bad news.

It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.

Medians and quantiles, though, can't be combined, and those are what we want most of the time.

Someone who loves functional programming can tell us if metrics in this class are monoids or what.

There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:

https://github.com/tdunning/t-digest