What does HackerNews think of ClickHouse?

ClickHouse® is a free analytics DBMS for big data

Language: C++

#48 in Hacktoberfest
#12 in SQL
We use -Wall -Werror -Wextra -Weverything, plus we use clang-tidy with -Werror as well, plus we use ASan, TSan, MSan, UBSan, plus fuzzing (5+ different types of it), plus we update to the latest compiler on every release, so today it is clang-17, and we build every dependency from source, and always use hermetic builds (cross-compiling).

I think it is the only sane way for C++.

https://github.com/ClickHouse/ClickHouse/

As one of the contributors, I'm always happy to see interest and people using it.

Keeper is a really interesting challenge and we're really open to any kind of feedback and thoughts.

If you tried it out and have some feedback for it, I encourage you to create an issue (https://github.com/ClickHouse/ClickHouse), ask on our Slack, ping me directly on Slack... (just don't call me on my phone)

And don't forget that it's completely open-source like ClickHouse so contributors are more than welcome.

We have implemented asynchronous signal-safe in-process stack unwinding for always-on profiler in ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-perform...

The downside - it required many patches to LLVM's libunwind, and not all of them are accepted yet: https://bugs.llvm.org/show_bug.cgi?id=48186

ClickHouse source code: https://github.com/ClickHouse/ClickHouse

The order should be as follows:

- automated testing;

- -Wall -Wextra, and -Weverything with some exceptions;

- address, memory, thread, and undefined sanitizers;

- fuzzing (at least some sort of), coverage-guided fuzzing with sanitizers;

- static analysis, starting with clang-tidy;

- other static analyzers, such as Coverity, CodeQL - have the least priority.

An example of how everything of the above, everywhere, all at once is applied in public can be seen in the ClickHouse repository: https://github.com/ClickHouse/ClickHouse in the per-commit checks. But keep in mind that the checks are really heavy - run for multiple hours, and we have to apply kludges to support GitHub Actions with spot instances...

Also the dashboard can be checked here: https://aretestsgreenyet.com/

You don't have to give up clean code to achieve high performance. The complexity can be isolated and contained.

For example, take a look at ClickHouse codebase: https://github.com/ClickHouse/ClickHouse/

There is all sort of things: leaky abstractions, specializations for optimistic fast paths, dispatching on algorithms based on data distribution, runtime CPU dispatching, etc. Video: https://www.youtube.com/watch?v=ZOZQCQEtrz8

But: it has clean interfaces, virtual calls, factories... And, most importantly - a lot of code comments. And when you have a horrible piece of complexity, it can be isolated into a single file and will annoy you only when you need to edit that part of the project.

Disclaimer. I'm promoting ClickHouse because it deserves that.

Interesting to match with the observations from the practice of using ClickHouse[1][2] for time series:

1. Reordering to SOA helps a lot - this is the whole point of column-oriented databases.

2. Specialized codecs like Gorilla[3], DoubleDelta[4], and FPC[5] lose to simply using ZSTD[6] compression in most cases, both in compression ratio and in performance.

3. Specialized time-series DBMS like InfluxDB or TimescaleDB lose to general-purpose relational OLAP DBMS like ClickHouse [7][8][9].

[1] https://clickhouse.com/blog/optimize-clickhouse-codecs-compr...

[2] https://github.com/ClickHouse/ClickHouse

[3] https://clickhouse.com/docs/en/sql-reference/statements/crea...

[4] https://clickhouse.com/docs/en/sql-reference/statements/crea...

[5] https://clickhouse.com/docs/en/sql-reference/statements/crea...

[6] https://github.com/facebook/zstd/

[7] https://arxiv.org/pdf/2204.09795.pdf "SciTS: A Benchmark for Time-Series Databases in Scientific Experiments and Industrial Internet of Things" (2022)

[8] https://gitlab.com/gitlab-org/incubation-engineering/apm/apm... https://gitlab.com/gitlab-org/incubation-engineering/apm/apm...

[9] https://www.sciencedirect.com/science/article/pii/S187705091...

We achieved more than 3 times speedup using "on-demand" transparent huge pages in ClickHouse[1] for a very narrow use-case: random access to a hash table that does not fit in the L3 cache but is not much larger.

But there was a surprise... more than 10 times degradation of overall Linux server performance due to increased physical memory fragmentation after a few days in production: https://github.com/ClickHouse/ClickHouse/commit/60054d177c8b...

It was seven years ago, and I hope that the Linux kernel has been improved. I will need to try "revert of revert" of this commit. These changes cannot be tested by microbenchmarks, and only production usage can show their actual impact.

Also, we successfully use huge pages for text section of the executable, and it is beneficial for the stability of performance benchmarks due to lowering the number of iTLB misses.

[1] ClickHouse - high-performance OLAP DBMS: https://github.com/ClickHouse/ClickHouse/

There were countless attempts to extend or replace SQL:

OQL: https://en.wikipedia.org/wiki/Object_Query_Language UnQL: https://www.dataversity.net/unql-a-standardized-query-langua...

More modern:

PRQL: https://prql-lang.org/ Malloy: https://news.ycombinator.com/item?id=30053860 (it is so obscure that Google replaces it to "Malay language")

Another example: ClickHouse supports standard SQL with many features such as window functions, SQL/JSON, and extends it to make it convenient for data analysts by adding: - higher-order functions; nested data structures, arrays, tuples, and maps; aggregate function states as first-class citizens, unrestricted usage of aliases in any place of expression, etc.

https://github.com/ClickHouse/ClickHouse/

I'm an everyday user of ClickHouse, and I'm finding its SQL implementation the most pleasant to use! Although it's unsurprising, because I'm also one of its authors... I'm also welcoming innovation and improvement of SQL without the introduction of a completely different language.

Simply by larger sizes of compressed blocks, which are limited to page size in Postgres, and by improving the data locality by sorting, which is inherent for LSM-trees.

But if you want higher compression, you need to consider column-oriented DBMS, such as ClickHouse[1]. They are unbeatable in terms of data compression.

[1] https://github.com/ClickHouse/ClickHouse

Disclaimer: I'm a developer of ClickHouse.

I also found it strange that PostgreSQL cannot simply support data formats of previous versions. In contrast, the latest version of ClickHouse[1] (23.1) can be installed over the version from 2016, and it does not require format conversions or any other migration procedures.

[1] https://github.com/ClickHouse/ClickHouse/

Syscalls can be heavier than expected. One example is when an application is run inside gVisor. Another example is when a lot of eBPF code is attached. A third example is when a program is run under strace.

Disclaimer: I'm working on ClickHouse[1], and it is used by thousands of companies in unimaginable environments. It has to work in every possible condition... That's why we set the TZ variable at startup and also embed the timezones into the binary. And we don't use the glibc functions for timezone operations because they are astonishingly slow.

https://github.com/ClickHouse/ClickHouse/

Memory bound alternative for the COUNT(DISTINCT ...) aggregate function in SQL.

ClickHouse[1] has a few functions for that, with different tradeoffs in memory usage, precision, and performance:

    - uniqExact - the same as COUNT(DISTINCT ...) - the exact version, using a hash table;
    - uniqCombined - a combination of a linear array of small size in memory arena, a hash table, and a HyperLogLog - this data structure is similar to HyperLogLog++; the log2 of the number of cells is controllable by the user;
    - uniq - a cardinality estimator based on adaptive sampling by hash, lower quality to memory usage compared to HyperLogLog, but beats it in CPU efficiency;
    - uniqTheta - uses the Theta sketch algorithm, whatever it means;
    - uniqUpTo(N) - my favorite, uses a linear array to give the precise answer up to N, and always answers N + 1 when the number of distinct elements is larger than N;
    - groupBitmap - an exact calculation using Roaring Bitmaps, makes sense for dense integer sets;
[1] https://github.com/ClickHouse/ClickHouse/

What is often forgotten in designing a data structure for a cardinality estimator - is that it should work well not only for a few large states but also for a large number of small sets.

For example, in a query like follows:

    SELECT URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL
You should expect that there are a large number of URLs, but most of them will get just 1..2 unique UserIDs. Obviously, it's not practical to allocate even a 4 KB HyperLogLog for every resulting record. That's how complex combined data structures are justified. They should start with ~16..64 bytes of automatic memory in arena or stack, and only then upgrade to a HyperLogLog.
+ totally agree with that.

There is a use case when you can improve performance by keeping compressed (LZ4) data in RAM and decompressing by small blocks that fit in cache. This is demonstrated by ClickHouse[1][2] - the whole data processing after decompression fits in cache, and compression saves the RAM bandwidth.

[1] https://presentations.clickhouse.com/meetup53/optimizations/ [2] https://github.com/ClickHouse/ClickHouse

The most developed analytical query engine as of today is ClickHouse [1].

It is open-source, has the separation of storage and compute and the best performance.

ClickHouse is a source of inspiration for many emerging database engines: Velox, Doris, DuckDB, Datafusion...

[1] https://github.com/ClickHouse/ClickHouse

Disclaimer: I work on ClickHouse.

ClickHouse updating materialized views in realtime.

How it is done:

1. Intermediate states of aggregate functions are first-class data types in ClickHouse. You can create a data type for the intermediate state of count(distinct), or quantile, whatever, and it will store a serialized state. It can be inserted into a table, read back, merged or finalized, and processed further.

2. Materialized views are simple triggers on INSERT. It enables incremental aggregation in realtime with materialized views.

Downsides:

Limited area of application. Does not respect multi-table queries if updates are done on more than one table.

Disclaimer: I'm developer of ClickHouse: https://github.com/ClickHouse/ClickHouse

Is this fully hosted or runs in my cloud? Is this running https://github.com/ClickHouse/ClickHouse or a closed-source Clickhouse Cloud variant with added features?
I'm not sure what you mean. ClickHouse is an open source project. You can check it out here https://github.com/ClickHouse/ClickHouse
Clickhouse for OLAP and ydb for OLTP are both in C++. Both are used at production scale in Yandex. Clickhouse has been around for a long time while ydb is pretty new.

Clickhouse - https://github.com/ClickHouse/ClickHouse

ydb - https://github.com/ydb-platform/ydb

> After the JetBrains hack

AFAIK there was no evidence that solarwind fiasco was because of jetbrains [0], also the clickhouse is open source [1] under apache2 license and you can check the source and question it by yourself.

[0] - https://www.zdnet.com/article/jetbrains-denies-being-involve...

[1] - https://github.com/ClickHouse/ClickHouse

ClickHouse has very clean and modern C++ codebase. https://github.com/ClickHouse/ClickHouse