Memory bound alternative for the COUNT(DISTINCT ...) aggregate function in SQL.

ClickHouse[1] has a few functions for that, with different tradeoffs in memory usage, precision, and performance:

    - uniqExact - the same as COUNT(DISTINCT ...) - the exact version, using a hash table;
    - uniqCombined - a combination of a linear array of small size in memory arena, a hash table, and a HyperLogLog - this data structure is similar to HyperLogLog++; the log2 of the number of cells is controllable by the user;
    - uniq - a cardinality estimator based on adaptive sampling by hash, lower quality to memory usage compared to HyperLogLog, but beats it in CPU efficiency;
    - uniqTheta - uses the Theta sketch algorithm, whatever it means;
    - uniqUpTo(N) - my favorite, uses a linear array to give the precise answer up to N, and always answers N + 1 when the number of distinct elements is larger than N;
    - groupBitmap - an exact calculation using Roaring Bitmaps, makes sense for dense integer sets;

[1] https://github.com/ClickHouse/ClickHouse/

What is often forgotten in designing a data structure for a cardinality estimator - is that it should work well not only for a few large states but also for a large number of small sets.

For example, in a query like follows:

    SELECT URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL

You should expect that there are a large number of URLs, but most of them will get just 1..2 unique UserIDs. Obviously, it's not practical to allocate even a 4 KB HyperLogLog for every resulting record. That's how complex combined data structures are justified. They should start with ~16..64 bytes of automatic memory in arena or stack, and only then upgrade to a HyperLogLog.