What does HackerNews think of tsbs?

Time Series Benchmark Suite, a tool for comparing and evaluating databases for time series data

Language: Go

[One edit, adding one additional paragraph at the end]

Note that I'm one of the co-founder of QuestDB, but let me try to be as objective and un-biased as possible. Under the hood, InfluxDB and QuestDB are built differently. Both storage engines are column-oriented. InfluxDB's storage engine uses a Time-Structured Merge Tree (TSM), while QuestDB uses a linear data structure (arrays). A linear data structure makes it easier to leverage modern hardware with native support for CPU's SIMD instructions [1]. Running close to the hardware is one of the key differentiators of QuestDB from an architectural standpoint.

Both have a Write-Ahead Log (WAL) that makes the data durable in case of an unexpected failure. Both use the InfluxDB Line Protocol to ingest data efficiently. Hats off to InfluxDB's team, we found the ILP implementation very neat. However, QuestDB's implementation of ILP is over TCP rather than HTTP for performance reasons. QuestDB is Postgres Wire compatible, meaning that you could also ingest via Postgres, although for market data it would not be the recommended way.

One characteristic of QuestDB is that data is always ordered by time on disk, and out-of-order data is dealt with before touching the disk [2]. The data is partitioned by time. For queries spanning time intervals, the relevant time partitions & columns are lifted to memory, while others are left untouched. This makes such queries (downsampling, interval search etc) particularly fast and efficient.

From a developer experience standpoint, one material difference is the language: InfluxDB has got its own native language, Flux [3], while QuestDB uses SQL, with a bunch of native SQL extensions to manipulate time-series data efficiently: SAMPLE BY, LATEST ON, etc [4]. QuestDB also includes SQL Joins and time-series join (ASOF Join) popular for market data. Since QuestDB speaks the postgresql protocol, developers can use their standard Postgres libraries to query from any language.

From a performance perspective, InfluxDB is known to struggle with ingestion and queries alongside high-cardinality datasets [5]. QuestDB deals with such high cardinality datasets better and is particularly good at ingesting data from concurrent sources, with a max throughput can now reach nearly 5M rows/sec on a single machine. Benchmarks on TSBS [6] with the latest version will follow soon.

InfluxDB is a platform, meaning that they provide an exhaustive offering around the database, while QuestDB is less mature. QuestDB is not yet fully compatible with several tools (say a dashboard like metabase for example), as some popular ones have been prioritised instead (Grafana, Kafka, Telegraf, Pandas dataframes). The charting capabilities of InfluxDB's console are excellent, while QuestDB users would mostly rely on Grafana instead.

[Adding this via post edit #1] One area Influx currently has edge is storage overhead. QuestDB does not support compression yet. Time-series data can often be compressed well [7]. Chances are QuestDB will use more disk space to store the same amount of data.

Hope this helps!

[1] https://news.ycombinator.com/item?id=22803504 [2] https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-... [3] https://docs.influxdata.com/influxdb/cloud/query-data/get-st... [4] https://questdb.io/blog/2022/11/23/sql-extensions-time-serie... [5] https://docs.influxdata.com/influxdb/cloud/write-data/best-p... [6] https://github.com/timescale/tsbs [7] https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

yes correct - although Clickhouse is more of an OLAP database. Timescale is built on top of Postgres, while QuestDB is built from scratch with Postgres wire compatibility. You can run benchmarks on https://github.com/timescale/tsbs
Last year we released QuestDB 6.0 and achieved an ingestion rate of 1.4 million rows per second (per server). We compared those results to popular open source databases [1] and explained how we dealt with out of order ingestion under the hood while keeping the underlying storage model read-friendly. Since then, we focused our efforts on making queries faster, in particular filter queries with WHERE clauses. To do so, we once again decided to make things from scratch and built a JIT (Just-in-Time) compiler for SQL filters, with tons of low-level optimisations such as SIMD. We then parallelized the query execution to improve the execution time even further. In this blog post, we first look at some benchmarks against Clickhouse and TimescaleDB, before digging deeper in how this all works within QuestDB's storage model. Once again, we use the Time Series Benchmark Suite (TSBS) [2], developed by TimescaleDB,: it is an open source and reproducible benchmark.

We'd love to get your feedback!

[1]:https://news.ycombinator.com/item?id=27411307

[2]:https://github.com/timescale/tsbs

To replicate, please see the Time Series Benchmark Suite, which is open-source and has many vendor-contributed configurations:

- https://github.com/timescale/tsbs

- https://github.com/timescale/tsbs/blob/master/docs/timestrea...

(Timescale co-founder)

I'll answer this here with a similar response that I gave Pradeep (the author) via Twitter.

I think ClickHouse is a great technology. It totally beats TimescaleDB for OLAP queries. I'll be the first to admit that.

What our (100+ hour, 3 month analysis) benchmark showed is that for _time-series workloads_, TimescaleDB fared better. [0]

Pradeep's analysis - while earnest - is essentially comparing OLAP style queries using a dataset that is not very representative of time-series workloads. Which is why the time-series benchmark suite (TSBS) [1] exists (which we did not create, although we now maintain it). I've asked Pradeep to compare using the TSBS - and he said he'd look into it. [2]

As a developer, I'm very wary of technologies that claim to be better at everything - especially those who hide their weaknesses. We don't do that at TimescaleDB. For those who read our benchmark closely, we clearly show where ClickHouse beats TimescaleDB, and where TimescaleDB does better. And - despite what many commenters on here may want you to think - we heap loads of praise on ClickHouse.

As a reader of HackerNews, I'm also tired of all the negativity that's developing on this site. People who bully. People who default to accusing others of dishonesty instead of trying to have a meaningful dialogue and reach mutual understanding. People who enter debates wanting to be right, versus wanting to identify the right answer. Disappointingly, this includes some visible influencers whom I personally know. We should all strive to do better, to assume positive intent, and have productive dialogues.

(This is why one of our values at TimescaleDB is "Assume Positive Intent." [3] I think Hacker News - and the world in general - would be a much better, happier, healthier place if we all just did that.)

[0] https://blog.timescale.com/blog/what-is-clickhouse-how-does-...

[1] https://github.com/timescale/tsbs

[2] https://twitter.com/p_chhetri/status/1455216425807745025

[3] https://www.timescale.com/careers

I see a lot of really divergent results with these time series database benchmarking posts. Timescale's open source benchmark suite[0] is a great contribution towards making different software comparable, but it seems like the tasks/metrics heavily favor TimescaleDB.

This article has Clickhouse more-or-less spanking TimescaleDB, but the blog post it references[1] is basically the reverse. Are the use cases just that different?

-----

0. https://github.com/timescale/tsbs

1. https://blog.timescale.com/blog/what-is-clickhouse-how-does-...

If you are referring to this post: https://altinity.com/blog/clickhouse-for-time-series

That post was written in November 2018 - 3 years ago - when TimescaleDB was barely 1.0.

A lot has changed since then:

1. TimescaleDB launched native columnar compression in 2019, which completely changed its story around storage footprint and query performance [0]

2. TimescaleDB has gotten much better

3. PostgreSQL has also gotten better (which in turn makes TimescaleDB better)

In fact, IIRC Altinity used and contributed ClickHouse to the TSBS [1], which is also what this newer benchmark uses as well

(Disclaimer: TimescaleDB co-founder)

[0] https://blog.timescale.com/blog/building-columnar-compressio...

[1] https://github.com/timescale/tsbs

(Timescale team member here)

We used the Time Series Benchmark Suite for all these tests https://github.com/timescale/tsbs. Also, Ryan (post author) will be giving all the config details in a Twitch stream happening next Wednesday. We'll be uploading the video to Youtube immediately afterwards too >>

twitch.tv/timescaledb youtube.com/timescaledb

[Timescale DevRel here]

@zX41ZdbW@ - Thanks for pointing out the various benchmarks that have been run by other companies between Clickhouse and TimescaleDB using TSBS[1]. As we mentioned, we'll dig deeper into a similar benchmark with much more detail than any of those examples in an upcoming blog post.

One notable omission on all of the benchmarks that we've seen is that none of them enable TimescaleDB compression (which also transforms row-oriented data into a columnar-type format). In our detailed benchmarking, queries on compressed columnar data in Timescale outperformed Clickhouse in most queries, particularly as cardinality increases, often by 5x or more. And with compression of 90% or more, storage is often comparable. (Again, blog post coming soon - we are just making sure our results are accurate before rushing to publish.)

The beauty of TimescaleDB columnar compression model is that it allows the user to decide when their workload can benefit from deep/narrow queries of data that doesn't change often (although it can still be modified just like regular row data), verses shallow/wide queries for things like inserting data and near-time queries.

It's a hybrid model that provides a lot of flexibility for users AND significantly improves the performance of historical queries. So yes, we do agree that columnar storage is a huge performance win for many types of queries.

And of course, with TimescaleDB, one also gets all of the benefits of PostgreSQL and its vibrant ecosystem.

Can't wait to share the details in the coming weeks!

[1]: https://github.com/timescale/tsbs

We launched QuestDB last summer [1, 2]. Our storage model is vector-based and append-only. This meant that all incoming data had to arrive in the correct time order. This worked well for some use cases but we increasingly saw real-world cases where data doesn't always land at the database in chronological order. We saw plenty of developers and users come and go specifically because of this technical limitation. So it became a priority to deal with out-of-order data.

The big decision was which direction to take to tackle the problem. LSM trees seemed an obvious choice, but we chose an alternative route so we wouldn't lose the performance we spent years building. Our latest release supports out-of-order ingestion by re-ordering data on the fly. That's what this article is about.

Also, we had many people asking about the differences between QuestDB and other open-source databases and why users should consider giving it a try instead of other systems. When we launched on HN, readers showed a lot of interest in side-by-side comparisons to other databases on the market. One suggestion [3] that we thought would be great to try out was to benchmark ingestion and query speeds using the Time Series Benchmark Suite (TSBS) [4] developed by TimescaleDB. We're super excited to share the results in the article.

[1] https://news.ycombinator.com/item?id=23975807

[2] https://news.ycombinator.com/item?id=23616878

[3] https://news.ycombinator.com/item?id=23977183

[4] https://github.com/timescale/tsbs

Disclosure: I work at AWS but not on Timestream. Opinions my own.

Unless I'm missing something this is not an apples to apples benchmark. TimescaleDB is running as a single node without any replication whereas Amazon Timestream is replicated[0] to three AWS Availability Zones for durability. I've only skimmed the TSBS[1] repo and the start/stop scripts for TimescaleDB. Can someone confirm this?

0 - https://aws.amazon.com/blogs/aws/store-and-access-time-serie...

1 - https://github.com/timescale/tsbs

Pre-reading hypothesis: TimescaleDB is declared orders of magnitude faster because the benchmark is serving results they're computing at writing time? Is it just like the ClickHouse benchmark from earlier, where they read from a `CREATE TABLE [...] ENGINE = AggregatingMergeTree`?

Post-reading:

"faster queries via continuous aggregates". So is this it? I couldn't find how tables / materialized views were created in the source though [1].

TimescaleDB is probably a very good product (and pg-compatible!), but producing such articles hiding the usage of a magic feature is sort of dishonest. Why not make an article directly on the power of the feature? It's hurting their brand reputation a bit.

[1] https://github.com/timescale/tsbs

This is on the roadmap, we will work on integrating with https://github.com/timescale/tsbs, TSBS has Victoria Metrics support too.
Perhaps the QuestDB team could add it to the Time Series Benchmarking Suite [1]? It currently supports benchmarking 9 databases including TimescaleDB and InfluxDB.

[1] https://github.com/timescale/tsbs

You can look at what we use in our benchmarking tool https://github.com/timescale/tsbs (results described here https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...).

Pretty much it's a table with time, value, tags_id. Where the tags table is id, jsonb

That's a good question! Especially considering these overwhelming benchmarks [1] made via Timescale TSBS [2].

[1] https://www.altinity.com/blog/clickhouse-for-time-series

[2] https://github.com/timescale/tsbs

(TimescaleDB co-founder)

TimescaleDB is more performant that you may think. We've benchmarked this extensively: eg outperforming vs InfluxDB [1] [2], vs Cassandra [3], vs Mongo [4].

We've also open-sourced the benchmarking suite so others can run these themselves and verify our results. [5]

We also beat MemSQL regularly for enterprise engagements (unfortunately can't share those results publicly).

I think the scalability of ClickHouse is quite compelling, and if you need more than 1-2M inserts a second and 100TBs of storage, then that would be one reason where I'd recommend another database over our own. But horizontal scalability is something we have been working on for nearly a year, so we expect this to be a less of an issue in the near future (will have more to share later this month).

You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.

[1] https://blog.timescale.com/timescaledb-vs-influxdb-for-time-...

[2] https://blog.timescale.com/what-is-high-cardinality-how-do-t...

[3] https://blog.timescale.com/time-series-data-cassandra-vs-tim...

[4] https://blog.timescale.com/how-to-store-time-series-data-mon...

[5] https://github.com/timescale/tsbs

Would something like the TSBS [1] help with this? It's TimescaleDB but they're built on Postgres. They have built-in high-CPU queries, but I haven't seen high-memory before. Can you point me in the right direction? Otherwise, we've had some Postgres people use the lab and are waiting on their decision whether to share publicly.

[1]: https://github.com/timescale/tsbs

Hey rw, one of the core contributors to TSBS here. First of all, thank you for the work you did on influxdb-comparisons, it gave us a lot to work with and helped us understand Timescale’s strengths and weaknesses against other systems early on. We do appreciate the diligence and transparency that went into the project. We outline some of the reasons for our eventual decision to fork the project in our recent release post [1]. Most of the reasons boil down to needing more flexibility in the data models/use cases we benchmark and needing a more maintainable code design since we’re using this widely for a lot of internal testing.

Verification of the correctness of the query results is obviously something we take very seriously, otherwise running these benchmarks would be pretty pointless. We carefully verified the correctness of all of the query benchmarks we published. However, it’s a process we haven’t fully automated yet. From what we can tell, the same is true of influxdb-comparisons — the validation pretty prints full responses but each database has a significantly different format, so one needs to manually parse the results or set up a separate tool to do so. We have our own methods for doing that internally — once we get the process more standardized and automated we will definitely be adding it to TSBS. We encourage anyone with ideas around that (or anything else) to take a look at the open source TSBS code and consider contributing [2].

[1] https://blog.timescale.com/time-series-database-benchmarks-t...

[2] https://github.com/timescale/tsbs