What does HackerNews think of t-digest?

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means

Language: Java

Understanding Percentiles (2021) | Aug 2023

Sounds like you’re looking for T-digests[1] - most production systems I’ve worked on that do this are using them under the hood.

1: https://github.com/tdunning/t-digest

Ask HN: How do you deal with information and internet addiction? | Feb 2023

> I get a lot of benefit from this information but somehow it feels shallow.

I take a longer view to this. For example, a few years ago I read about an algorithm to calculate percentiles in real time. [0]

It literally just came up at work today. I haven't used that information but maybe two times since I read it, but it was super relevant today and saved my team potential weeks of development.

So maybe it's not so shallow.

But to your actual question, I have a similar problem. The best I can say is that deadlines help. I usually put down the HN and Youtube when I have a deadline coming up. And not just at work. I make sure my hobbies have deadlines too.

I tell people when I think something will be done, so they start bugging me about it when it doesn't get done, so that I have a "deadline". Also one of my hobbies is pixel light shows for holidays, which come with excellent natural deadlines -- it has to be done by the holiday or it's useless.

So either find an "accountability buddy" who will hold you to your self imposed deadlines, or find a hobby that has natural deadlines, like certain calendar dates, or annual conventions or contests that you need to be done by.

[0] https://github.com/tdunning/t-digest

Ask HN: What are some cool but obscure data structures you know about? | Jul 2022

I am enamored by data structures in the sketch/summary/probabilistic family: t-digest[1], q-digest[2], count-min sketch[3], matrix-sketch[4], graph-sketch[5][6], Misra-Gries sketch[7], top-k/spacesaving sketch[8], &c.

What I like about them is that they give me a set of engineering tradeoffs that I typically don't have access to: accuracy-speed[9] or accuracy-space. There have been too many times that I've had to say, "I wish I could do this, but it would take too much time/space to compute." Most of these problems still work even if the accuracy is not 100%. And furthermore, many (if not all of these) can tune accuracy to by parameter adjustment anyways. They tend to have favorable combinatorial properties ie: they form monoids or semigroups under merge operations. In short, a property of data structures that gave me the ability to solve problems I couldn't before.

I hope they are as useful or intriguing to you as they are to me.

1. https://github.com/tdunning/t-digest

2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html

3. https://florian.github.io/count-min-sketch/

4. https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...

5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf

6. https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...

7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf

8. https://www.sciencedirect.com/science/article/abs/pii/S00200...

9. It may better be described as error-speed and error-space, but I've avoided the term error because the term for programming audiences typically evokes the idea of logic errors and what I mean is statistical error.

Ask HN: What are some cool but obscure data structures you know about? | Jul 2022

Expand Context ↕

On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:

t-digest https://github.com/tdunning/t-digest

DDSketch https://github.com/DataDog/sketches-java

Monarch: Google’s Planet-Scale In-Memory Time Series Database | May 2022

Expand Context ↕

Ah, I misunderstood what you meant. If you are reporting static buckets I get how that is better than what folks typically do but how do you know the buckets a priori? Others back their histograms with things like https://github.com/tdunning/t-digest. It is pretty powerful as the buckets are dynamic based on the data and histograms can be added together.

How percentile approximation works and why it's more useful than averages | Sep 2021

Expand Context ↕

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

P² quantile estimator – estimating the median without storing values | Nov 2020

Great article! One of my favorite data-structures lately is the https://github.com/tdunning/t-digest. Would be great to see how it compares and what additional accuracy we get but storing some of the data.

Monitoring demystified: A guide for logging, tracing, metrics | Aug 2020

Expand Context ↕

Yes, exactly this. It's the fact that you're doing aggregation in two places. Since you're always going to be aggregating on the backend, aggregating in the app is bad news.

It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.

Medians and quantiles, though, can't be combined, and those are what we want most of the time.

Someone who loves functional programming can tell us if metrics in this class are monoids or what.

There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:

https://github.com/tdunning/t-digest

Uber's M3DB, a distributed timeseries database | Feb 2020

Expand Context ↕

> which was too expensive for Uber to run (since the dataset is in petabytes).

Ok, but I am fairly confident Netflix also is at that kind of scale.

Netflix has a section on Atlas's documentation about how they get around this: https://github.com/Netflix/atlas/wiki/Overview#cost

They also did this nice video that outlines their entire operation including how they do rollups: https://www.youtube.com/watch?v=4RG2DUK03_0

This is how they do the rollup but keep their tails accurate to parts per million and the middle to be parts per hundred: https://github.com/tdunning/t-digest

We can do better than percentile latencies | Oct 2018

I agree with the premise but it seems that there are more solutions out there. As other commenters noted, you can collect histograms or hdrhistograms. Those have the problem of needing to be precofiguring and of not being able to be merged unless they are configured the same way.

Instead you can use the t-digest (https://github.com/tdunning/t-digest), a very cool online quantile estimation data structure from Ted Dunning (which he has recently improved with the Merging approach). There are a number of implementations out there. It is not unreasonable to serialize them and merge them. Unfortunately there’s no easy way to set this up in Prometheus but making that easy could be a fun project

Show HN: A Go implementation of TensorFlow's streaming quantiles estimator | Aug 2018

How does it compare to https://github.com/tdunning/t-digest?