A huge difference between monarch and other tsdb that isn’t outlined in this overview, is that a storage primitive for schema values is a histogram. Most (maybe all besides Circonus) tsdb try to create histograms at query time using counter primitives.

All of those query time histogram aggregations are making pretty subtle trade offs that make analysis fraught.

In my experience, Monarch storing histograms and being unable to rebucket on the fly is a big problem. A percentile line on a histogram will be incredibly misleading, because it's trying to figure out what the p50 of a bunch of buckets is. You'll see monitoring artifacts like large jumps and artificial plateaus as a result of how requests fall into buckets. The bucketer on the default RPC latency metric might not be well tuned for your service. I've seen countless experienced oncallers tripped up by this, because "my graphs are lying to me" is not their first thought.

Circonus Histograms solve that by using a universal bucketing scheme. Details are explained in this paper: https://arxiv.org/abs/2001.06561

Disclaimer: I am a co-author.

Wow, this is a fantastic solution to some questions I've had rattling around in my head for years about the optimal bucket choices to minimize error given a particular set of buckets.

Do I read right that circllhist has a pretty big number of bin sizes and is not configurable (except that they're sparse so may be small on disk)?

I've found myself using high-cardinality Prometheus metrics where I can only afford 10-15 distinct histogram buckets. So I end up

(1) plugging in my live system data from normal operations and from outage periods into various numeric algorithms that propose optimal bucket boundaries. These algorithms tell me that I could get great accuracy if I chose thousands of buckets, which, thanks for rubbing it in about my space problems :(. Then I write some more code to collapse those into 15 buckets while minimizing error at various places (like p50, p95, p99, p999 under normal operations and under irregular operations).

(2) making sure I have an explicit bucket boundary at any target that represents a business objective (if my service promises no more than 1% of requests will take >2500ms, setting a bucket boundary at 2500ms gives me perfectly precise info about whether p99 falls above/below 2500ms)

(3) forgetting to tune this and leaving a bunch of bad defaults in place which often lead to people saying "well, our graph shows a big spike up to 10000ms but that's just because we forgot to tune our histogram bucket boundaries before the outage, actually we have to refer to logs to see the timeouts at 50 sec"

I’ve used these log-linear history in a few pieces of code. There is some configurability in the abstract - you could choose a different logarithm base.

In practice none of the implementations seem to provide that. Within the each set of buckets for a given log base you have reasonable precision at that magnitude. If your metric is oscillating around 1e6 you shouldn’t care much about the variance at 1e2, and with this scheme you don’t have to tune anything to provide for that.

There are a large amount of subtle tradeoffs around the bucketing scheme (log, vs. log-linear, base) and memory layout (sparse, dense, chunked) the amount of configurability in the histogram space (circllhist, DDSketch, HDRHistogram, ...). A good overview is this discussion here:

https://github.com/open-telemetry/opentelemetry-specificatio...

As for the circllhist: There are no knobs to turn. It uses base 10 and two decimal digits of precision. In the last 8 years I have not seen a single use-case in the operational domain where this was not appropriate.