What does HackerNews think of octosql?
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Disclaimer: author of OctoSQL
[0]: https://github.com/cube2222/octosql
[1]: https://duckdb.org/
It's great to see more tools in this area (querying data from various sources in-place) and the Lambda use case is a really cool idea!
I've recently done a bunch of benchmarking including ClickHouse Local, and the usage was straightforward, with everything working as it's supposed to.
Just to comment on the performance avenue though, one area I think ClickHouse could still possibly improve on - vs OctoSQL[0] at least - is that it seems like the JSON datasource is slower, especially if only a small part of the JSON objects is used. If only a single field of many is used, OctoSQL lazily parses only that field, and skips the others, which yields non-trivial performance gains on big JSON files with small queries.
Basically, for a query like `SELECT COUNT(*), AVG(overall) FROM books.json` with the Amazon Review Dataset (10GB), OctoSQL is twice as fast (3s vs 6s). That's a minor thing though (OctoSQL will slow down for more complicated queries, while for ClickHouse decoding the input is and remains the bottleneck, with the processing itself being ridiculously fast).
Godspeed with the future evolution of ClickHouse!
I've been following Materialize as their blog posts are a great source of inspiration when working on OctoSQL[0] (a CLI SQL dataflow engine), but was a bit surprised with how few data sources they were supporting (basically Kafka and Postgres based on their docs), but now that they're switching/pivoting to being a database themselves, this makes much more sense.
I also think the architecture is really cool. Cloud-native is the way to go for modern databases and will make adoption much easier than something you'd have to host on bare metal. One question though, does this mean the open-source version is basically deprecated now and further development is closed-source, or does the open-source project represent the "compute" part of the "next gen Materialize"?
Congrats and good luck with further development!
That said, I don't imagine this ever being a bottleneck for the main use case of Steampipe - in that case I think the APIs themselves will always be the limiting part. But it does - potentially - speak to what you can expect if you'd like to extend your usage of Steampipe to more than just DevOps data.
I've used the benchmark available in the OctoSQL README.
[0]: https://github.com/cube2222/octosql
[1]: https://github.com/apache/arrow-datafusion
Disclaimer: author of OctoSQL