What does HackerNews think of octosql?

Using command line to process CSV files | Jun 2023

Just thought I'd plug Octosql[1] which I've enjoyed using for this. It parses CSV and JSON, which are the file types I parse the most.

[1] https://github.com/cube2222/octosql/

Analyzing multi-gigabyte JSON files locally | Mar 2023

OctoSQL[0] or DuckDB[1] will most likely be much simpler, while going through 10 GB of JSON in a couple seconds at most.

Disclaimer: author of OctoSQL

[0]: https://github.com/cube2222/octosql

[1]: https://duckdb.org/

Trustfall: How to Query (Almost) Everything | Feb 2023

Very interesting, sounds kinda like GraphQL counterpart to https://github.com/cube2222/octosql

Show HN: ClickHouse-local – a small tool for serverless data analytics | Jan 2023

Congrats on the Show HN!

It's great to see more tools in this area (querying data from various sources in-place) and the Lambda use case is a really cool idea!

I've recently done a bunch of benchmarking including ClickHouse Local, and the usage was straightforward, with everything working as it's supposed to.

Just to comment on the performance avenue though, one area I think ClickHouse could still possibly improve on - vs OctoSQL[0] at least - is that it seems like the JSON datasource is slower, especially if only a small part of the JSON objects is used. If only a single field of many is used, OctoSQL lazily parses only that field, and skips the others, which yields non-trivial performance gains on big JSON files with small queries.

Basically, for a query like `SELECT COUNT(*), AVG(overall) FROM books.json` with the Amazon Review Dataset (10GB), OctoSQL is twice as fast (3s vs 6s). That's a minor thing though (OctoSQL will slow down for more complicated queries, while for ClickHouse decoding the input is and remains the bottleneck, with the processing itself being ridiculously fast).

Godspeed with the future evolution of ClickHouse!

[0]: https://github.com/cube2222/octosql

The next generation of Materialize | Oct 2022

This is really impressive!

I've been following Materialize as their blog posts are a great source of inspiration when working on OctoSQL[0] (a CLI SQL dataflow engine), but was a bit surprised with how few data sources they were supporting (basically Kafka and Postgres based on their docs), but now that they're switching/pivoting to being a database themselves, this makes much more sense.

I also think the architecture is really cool. Cloud-native is the way to go for modern databases and will make adoption much easier than something you'd have to host on bare metal. One question though, does this mean the open-source version is basically deprecated now and further development is closed-source, or does the open-source project represent the "compute" part of the "next gen Materialize"?

Congrats and good luck with further development!

[0]: https://github.com/cube2222/octosql

Select * from cloud | Sep 2022

Expand Context ↕

To add somewhat of a counterpoint to the other response, I've tried the Steampipe CSV plugin and got 50x slower performance vs OctoSQL[0], which is itself 5x slower than something like DataFusion[1]. The CSV plugin doesn't contact any external API's so it should be a good benchmark of the plugin architecture, though it might just not be optimized yet.

That said, I don't imagine this ever being a bottleneck for the main use case of Steampipe - in that case I think the APIs themselves will always be the limiting part. But it does - potentially - speak to what you can expect if you'd like to extend your usage of Steampipe to more than just DevOps data.

I've used the benchmark available in the OctoSQL README.

[0]: https://github.com/cube2222/octosql

[1]: https://github.com/apache/arrow-datafusion

Disclaimer: author of OctoSQL