What does HackerNews think of dsq?

Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.

Language: Go

#29 in Go
#10 in JSON
#18 in SQL
Many of the tools shared in this thread simplify working with CSV files, but only some allow running proper SQL queries.

SQLite, DuckDB and Clickhouse-local have been mentioned, but another very simple one, a single dependency-free binary, is https://github.com/multiprocessio/dsq

Not affiliated, just a happy user

OctoSQL is an awesome project and Kuba has a lot of great experience to share from building this project I'm excited to learn from.

And while building a custom database engine does allow you to do pretty quick queries, there are a few issues.

First, the SQL implemented is nonstandard. As I was looking for documentation and it pointed me to `SELECT * FROM docs.functions fs`. I tried to count the number of functions but octosql crashed (a Go panic) when I ran `SELECT count(1) FROM docs.functions fs` and `SELECT count(*) FROM docs.functions fs` which is what I lazily do in standard SQL databases. (`SELECT count(fs.name) FROM docs.function fs` worked.)

This kind of thing will keep happening because this project just doesn't have as much resources today as SQLite, Postgres, DuckDB, etc. It will support a limited subset of SQL.

Second, the standard library seems pretty small. When I counted the builtin functions there were only 29. Now this is an easy thing to rectify over time but just noting about the state today.

And third this project only has builtin support for querying CSV and JSON files. Again this could be easy to rectify over time but just mentioning the state today.

octosql is a great project but there are also different ways to do the same thing.

I build dsq [0] which runs all queries through SQLite so it avoids point 1. It has access to SQLite's standard builtin functions plus a battery of extra statistic aggregation, string manipulation, url manipulation, date manipulation, hashing, and math functions custom built to help this kind of interactive querying developers commonly do [1].

And dsq supports not just CSV and JSON but parquet, excel, ODS, ORC, YAML, TSV, and Apache and nginx logs.

A downside to dsq is that it is slower for large files (say over 10GB) when you only want a few columns whereas octosql does better in some of those cases. I'm hoping to improve this over time by adding a SQL filtering frontend to dsq but in all cases dsq will ultimately use SQLite as the query engine.

You can find more info about similar projects in octosql's Benchmark section but I also have a comparison section in dsq [2] and an extension of the octosql benchmark with different set of tools [3] including duckdb.

Everyone should check out duckdb. :)

[0] https://github.com/multiprocessio/dsq

[1] https://github.com/multiprocessio/go-sqlite3-stdlib

[2] https://github.com/multiprocessio/dsq#comparisons

[3] https://github.com/multiprocessio/dsq#benchmark*

This is a cool project! But if you just want a simple CLI, you can query Excel and ODS files with dsq [0]. It's the same end result without needing to carry plugins around and with support for many additional data formats. Plus a growing standard library of functions that don't come built into SQLite such as best-effort date parsing, URL parsing/extraction, statistical aggregation functions, math functions, string and regex helpers, hashing functions and so on [1].

An annoying thing about this extension-based style of file support is needing to create a new table for every new file if the schema is different. This is a limitation [2] of sqlite unfortunately. dsq doesn't work this way so it doesn't have that limit.

On the other hand, if you go this route you can more easily combine with other extensions. That's not really possible with dsq right now.

[0] https://github.com/multiprocessio/dsq

[1] https://github.com/multiprocessio/go-sqlite3-stdlib

[2] https://sqlite.org/forum/forumpost/ec944414fa

Hey folks! I quit my job at Oracle almost a year ago now to build DataStation. It's an app I've wanted as an engineering manager for years. It's entirely open-source and while I've had a few awesome contributors I'm mostly the only person on it. It has been funded out of contract development and savings.

DataStation helps you query a variety of data sources (conventional SQL like PostgreSQL and MySQL, non-SQL like Prometheus or Elasticsearch), files and HTTP APIs. It is not a SQL layer on top of these various APIs like FDW in Postgres or Apache Calcite.

DataStation just tries to abstract away glue code. So in DataStation for Prometheus you query with PromQL. For Elasticsearch you query with Lucene. And for SQL databases you query with their SQL dialect. But you don't need to remember how to use the appropriate library for your language. You just need your own credentials.

DataStation is made of panels (other apps might call them cells) that each produce a result. Panels can refer to other panels. These allow you to build workflows that cross the boundary of a particular datasource. For example you might have some data in a CSV a product manager gave you and the bulk of your data is in PostgreSQL. In DataStation you could pull in the CSV with a File panel and pull in the Postgres data with a Database panel. Then you can join both panel results in a Code panel using your favorite language like Python, Ruby, R, Node, Julia, etc. You can even script Code panels in a SQLite dialect with a bunch of rich addons (url parsing, best-effort date parsing, statistics aggregation, etc.): https://github.com/multiprocessio/go-sqlite3-stdlib.

You can watch a simple introductory video: https://www.youtube.com/watch?v=q_jRBvbwIzU. Or if you want to see that cross-datasource interaction taken to an extreme, check out this video using Postgres metadata to filter log data in Elasticsearch to do historic request analysis on a subset of customers: https://www.youtube.com/watch?v=tIh99YVHoRE.

DataStation is mainly a desktop app today where the end result is that you export graph SVGs or HTML tables or markdown tables or just a CSV file. All this data stays on your laptop so it's as easy to use in a corporate environment as any existing SQL IDE or Jupyter Notebook.

In the last year it's reached 1.5k stars on Github, over 1000 unique users and currently on-average about 40 fairly active users per month (defined as having opened the app more than a few times).

Since it's only just now 12 months old it's been going through a lot of maturing during this time. If you've tried it before and it was buggy or too slow it's probably worth another try now if you're still interested.

DataStation is primarily an Electron app but the code that evaluates panels is written in Go. The Go evaluation code forms the backbone of another app you may have seen around HN, dsq: https://github.com/multiprocessio/dsq, which is a limited version of DataStation as a CLI for querying files with SQL.

In the future I'd like to see more people using it as a server app where my goal is to support read-only dashboards and recurring exports. That part is still work-in-progress.

You can find a ton of tutorials on how to interact with supported databases on the DataStation website: https://datastation.multiprocess.io/docs/.

Looking forward to your feedback!

I use Go heavily cross-platform developing DataStation [0] and dsq [1]. I am not an expert. And I don't have proof for it but on some rudimentary benchmarks the Linux-specific file idioms in the Go standard library definitely don't seem to translate well to even macOS let alone Windows. For example some good streaming techniques for reading large files on Linux that work really well there seemed to be pretty bad on macOS.

I think Amos has presented more proof than I can on the topic of just how Linux-influenced Go is. And I think it is fine for the majority of Go users because the majority users of Go are building server apps or Linux CLIs.

Amos has spent some time building cross-platform desktop systems with Go for itch.io and I think I'm seeing some of the same things they are in that scenario.

I think this is a reasonable article. I didn't notice anything too flame-y myself but if for you Amos gets flame-y at any point I think that's worth ignoring because there does seem to be something up with Go in cross-platform applications. Amos does have good experience here. (Go look at itchi.io's github like wharf [2] or butler [3] where they are/were the main contributor.)

I like Go a lot and for most things I'd keep using it still. Just sharing some observations.

[0] https://github.com/multiprocessio/datastation

[1] https://github.com/multiprocessio/dsq

[2] https://github.com/itchio/wharf

[3] https://github.com/itchio/butler

If jq is getting too slow for you (that's never happened for me), it really seems like it's time to put your data in a database like sqlite or duckdb at least.

Incidentally there are many tools that help you do this like dsq [0] (which I develop), q [1], textql [2], etc.

[0] https://github.com/multiprocessio/dsq

[1] https://github.com/harelba/q

[2] https://github.com/dinedal/textql

In addition to jq (already mentioned) here are some other useful CLI tools for dealing with JSON data. Descriptions are directly from GitHub. Note: dsq also has a companion GUI app called DataStation[0] if you're looking for that.

fx: Command-line JSON processing tool - https://github.com/antonmedv/fx

dasel: JSON, YAML, TOML, XML, and CSV query and modification tool - https://github.com/TomWright/dasel

dsq: CLI tool for running SQL queries against JSON, CSV, Excel, Parquet, and more - https://github.com/multiprocessio/dsq

gron: Make JSON greppable - https://github.com/tomnomnom/gron

jello: Filter JSON and JSON Lines data with Python syntax - https://github.com/kellyjonbrazil/jello

jless: Command-line pager for JSON data - https://github.com/PaulJuliusMartinez/jless

jid: Json incremental digger - https://github.com/simeji/jid

jql: JSON query language CLI tool - https://github.com/yamafaktory/jql

qp: Command-line (ND)JSON querying - https://github.com/paybase/qp

[0]: https://github.com/multiprocessio/datastation

Nice work! I have a similar tool built on SQLite [0] and there are a number of similar tools (compared in my readme) like octosql, q, textql, datasette, etc.

Implementation-wise, is there a reason you wrote your own SQL parser rather than using an existing library? I ask not because I'd discourage handwritten SQL parsers but that I think the particular approach you took is going to suffer when it comes to error messages and lexical edge cases like `SELECT(SELECT 1)`.

Not to mention that it introduces a completely new SQL dialect users would have to learn. (I think even SQLite is a bit risky since most people know MySQL or PostgreSQL dialects.)

But I can appreciate you wanting your own syntax especially for nested objects. In dsq I solved that by giving nested fields a literal column name with a dot in it. But that means you have to quote the column name (e.g. `SELECT "city.address" FROM {}`).

The additional penalty you'd pay not backing this by SQLite or DuckDB or some other in-memory database is in query performance as the input grows. Among other existing tools like this octosql has its own query engine but cube2222 has put a lot of time into optimizing it and it's also written in Go.

Whatever the case, competition is good! Happy to chat more if you want. I love database projects!

[0] https://github.com/multiprocessio/dsq

I shared the first version of this in a Show HN last month [0]. The big update since then is that it now supports loading multiple files and doing SQL joins on them. You can see examples of that in this post.

The repo is here [1] and the README has a comparison against some of the other great tools in the space like q, octosql, and textql.

[0] https://news.ycombinator.com/item?id=29643835

[1] https://github.com/multiprocessio/dsq