Looks cool. What are some applications of a cardinality estimator?
Memory bound alternative for the COUNT(DISTINCT ...) aggregate function in SQL.
ClickHouse[1] has a few functions for that, with different tradeoffs in memory usage, precision, and performance:
- uniqExact - the same as COUNT(DISTINCT ...) - the exact version, using a hash table;
- uniqCombined - a combination of a linear array of small size in memory arena, a hash table, and a HyperLogLog - this data structure is similar to HyperLogLog++; the log2 of the number of cells is controllable by the user;
- uniq - a cardinality estimator based on adaptive sampling by hash, lower quality to memory usage compared to HyperLogLog, but beats it in CPU efficiency;
- uniqTheta - uses the Theta sketch algorithm, whatever it means;
- uniqUpTo(N) - my favorite, uses a linear array to give the precise answer up to N, and always answers N + 1 when the number of distinct elements is larger than N;
- groupBitmap - an exact calculation using Roaring Bitmaps, makes sense for dense integer sets;
[1] https://github.com/ClickHouse/ClickHouse/What is often forgotten in designing a data structure for a cardinality estimator - is that it should work well not only for a few large states but also for a large number of small sets.
For example, in a query like follows:
SELECT URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL
You should expect that there are a large number of URLs, but most of them will get just 1..2 unique UserIDs. Obviously, it's not practical to allocate even a 4 KB HyperLogLog for every resulting record. That's how complex combined data structures are justified. They should start with ~16..64 bytes of automatic memory in arena or stack, and only then upgrade to a HyperLogLog.