What does HackerNews think of datasette?

An open source multi-tool for exploring and publishing data

Language: Python

#5 in Docker
#1 in JSON
#10 in Python
#4 in SQL
I would recommend SQLite + a nice usable interface tool like



Or SQLitebrowser https://sqlitebrowser.org/

(There are other generalized tools, search around to see if there are more suitable.)

You can store whole files, in addition to searchable text fields , and the single file is easier to share or turn into a more sophisticated “app” when the time is right.

Opened an issue exploring alternatives here: https://github.com/simonw/datasette.io/issues/109

I decided to just drop "any size" but keep "any shape".

Interesting - I hadn't thought about putting that right on the homepage. The GitHub README has that: https://github.com/simonw/datasette

I worry that my target audience for the homepage won't necessarily have access to a working Python and pip, and so won't be able to just run that command without additional guidance (see https://docs.datasette.io/en/stable/installation.html )

That's why I emphasize trying out a hosted demo on the homepage instead.

  > * Which dialect for querying worked the best for you?
If this is SQL dialects, I am partial to both the Postgres JSON/JSONB operators, SQLite has a solid JSON implementation too.

  > * Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.
"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:


From the same author, "sqlite-utils" generate SQLite table definitions and rows from similar:


I find this useful outside of SQLite because the syntax is similar across SQL databases. Great way to bootstrap a DB definition from existing datasets.

"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"

  > * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.
For file, either JSONL/NDJSON (JSON object per line) if it's large, or a single array of objects if it's small.

In a SQL context, object-per-row undoubtedly

  > * What do you feel is the biggest gap/missing feature in currently available tools?
Don't have anything useful on this one unfortunately, but I'm not an analytics/data science person =(
What's the difference between this, and Datasette which has been in development multiple years by the Django co-creator?

Sorry if it's a dumb question but I skimmed the docs and sample and it looks similar + didn't see any mention:


I'm working on an open-source desktop app [0] similar to Jupyter but more oriented toward non-data-science developer workflows like querying/joining data from multiple databases and scripting and graphing for internal reporting.

Another similar tool, but maybe more data science focused, is Simon Willison's Datasette [1].

[0] https://github.com/multiprocessio/datastation

[1] https://github.com/simonw/datasette

I've come around to almost the opposite approach.

I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - https://dogsheep.github.io/

For my own personal Dogsheep (https://simonwillison.net/2020/Nov/14/personal-data-warehous...) that's 119 tables right now.

Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.

If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.

This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.

I even have a search engine for this, which is populated by SQL queries against the different source tables. You can see an example of how that works at https://github.com/simonw/datasette.io/blob/main/templates/d... - which powers the search interface at https://datasette.io/-/beta

My version of this is https://dogsheep.github.io/ - the idea is to pull your digital footprint from various different sources (Twitter, Foursquare, GitHub etc) into SQLite database files, then run Datasette on top to explore them.

On top of that I built a search engine called Dogsheep Beta which builds a full-text search index across all of the different sources and lets you search in one place: https://github.com/dogsheep/dogsheep-beta

You can see a live demonstration of that search engine on the Datasette website: https://datasette.io/-/beta?q=dogsheep

The key difference I see with Apollo is that Dogsheep separates fetching of data from search and indexing, and uses SQLite as the storage format. I'm using a YAML configuration to define how the search index should work: https://github.com/simonw/datasette.io/blob/main/templates/d... - it defines SQL queries that can be used to build the index from other tables, plus HTML fragments for how those results should be displayed.

I've been exploring this technique more over the past year and I really like it - https://datasette.io (code at https://github.com/simonw/datasette.io ) is a more recent and much more complicated example.

Extracting links from markdown and using them to populate some additional columns or tables at build time would be pretty straight forward.

This looks pretty efficient. Some chains can be interacted with without e.g. web3.js? LevelDB indexes aren't SQLite.

Datasette is one application for views of read-only SQLite dbs with out-of-band replication. https://github.com/simonw/datasette

There are a bunch of *-to-sqlite utilities in corresponding dogsheep project.

Arrow JS for 'paged' browser client access to DuckDB might be possible and faster but without full SQLite SQL compatibility and the SQLite test suite. https://arrow.apache.org/docs/js/

https://duckdb.org/ :

> Direct Parquet & CSV querying

In-browser notebooks like Pyodide and Jyve have local filesystem access with the new "Filesystem Access API", but downloading/copying all data to the browser for every run of a browser-hosted notebook may not be necessary. https://web.dev/file-system-access/

I've been using my Datasette tool, originally intended for exploring databases, to build full-blown websites and it's working surprisingly well. It supports Jinja templates and runs on SQLite databases so it's actually a good fit for semi-static sites.

Here's what Datasette looks like by default:

- https://latest.datasette.io

And here are three websites that are actually just Datasette with some custom templates and plugins:

- https://datasette.io (data: https://datasette.io/content code: https://github.com/simonw/datasette.io )

- https://til.simonwillison.net (data: https://til.simonwillison.net/tils code: https://github.com/simonw/til )

- https://www.niche-museums.com (data: https://www.niche-museums.com/browse code: https://github.com/simonw/museums )

They're all hosted on either Vercel or Google Cloud Run at the cost of no more than a few dollars a month.

Not "in lieu", but I have found SQLite to be a good companion to pandas. pandas is good when you want to lend some easily navigable structure to your data without leaving python, but becomes cumbersome (it feels cumbersome esp if you know SQL) when you want to do some serious processing. At such times, I just dump my data into a SQLite DB, and from then on my code has SQL queries to it, which I find are far more readable than corresponding pandas statements (also easier for the reviewer).

Another advantage you get is if you wanted to look at these intermediate data, you don't need to run your code in debug mode and view the dataframe at a breakpoint - you can use something like datasette[1] or a standard SQLite DB viewer.

So a function that does complex data processing has approximately this structure my code:

Step (3) used to be pandas for me before, but depending on how complex your operations are this can become hard to read and/or review.

[1] https://github.com/simonw/datasette - datasette doesnt replace a standard DB IDE, but is a very good lightweight alternative to one if don't intend to perform updates/inserts directly on a table.

With no sense of overstatement here, SQLite is one of my favorite creations in the entire world, so I have a bunch of links some of you might find interesting if you want to dig further:

https://github.com/sql-js/sql.js - SQL.js lets you run SQLite within a Web page as it's just SQLite compiled to JS with Emscripten.

https://litestream.io/blog/why-i-built-litestream/ - Litestream is a SQLite-powered streaming replication system.

https://sqlite.org/lang_with.html#rcex3 - you can do graph-style queries against SQLite too (briefly mentioned in the article).

https://github.com/aergoio/aergolite - AergoLite is replicated SQLite but secured by a blockchain.

https://github.com/simonw/datasette - Datasette (mentioned at the very end of the OP article) is a tool for offering up an SQLite database as a Web accessible service - you can do queries, data analysis, etc. on top of it. I believe Simon, the creator, frequents HN too and is a true SQLite power user :-)

https://dogsheep.github.io/ - Dogsheep is a whole roster of tools for doing personal analytics (e.g. analyzing your GitHub or Twitter use, say) using SQLite and Datasette.

I run SQLite in serverless environments (Cloud Run, Vercel, Heroku) for dozens of projects... but the trick is that they all treat the database as a read-only asset.

If I want to deploy updated data, I build a brand new image and deploy the application bundled with the data. I tend to run the deploys for these (including the database build) in GitHub Actions workflows.

This works really well, but only for applications that don't need to apply updates more than a few times an hour! If you have a constant stream of updates I still think you're better off using a hosted database like Heroku PostgreSQL or Google Cloud SQL.

One example of a site I deploy like that is https://datasette.io/ - it's built and deployed by this GitHub Actions workflow here: https://github.com/simonw/datasette.io/blob/main/.github/wor...

This is the new Datasette project website - the open source project is three years old now but I launched this website yesterday. Up until now there's just been the GitHub repo at https://github.com/simonw/datasette and the project documentation at https://docs.datasette.io/

The https://datasette.io/ site is running on Datasette itself, with a bunch of custom templates. If you're interested in seeing how it works the source code (and the GitHub Actions that build and deploy the underlying database) can be found here: https://github.com/simonw/datasette.io

I've actually prototyped this recently, but never got around to packaging it up into a library.

There are a few neat adjacent things that it can be used for (e.g. opening big SQLite files on disk without reading it all to memory), building something like Datasette (https://github.com/simonw/datasette) that can run queries on data hosted as static files, or for (as you suggested) using SQLite in a browser with persistence.

For that particular use case, there's a bit of complexity related to mutexes and stuff in trying to prevent simultaneous browser tabs doing write operations from corrupting the database.

For a quick web interface, I’ve used this on top of sqlite and it’s fantastic:


A giant plant as in someone from their team planning this out and then planting this as a comment..?

simonw wrote Datasette[1] which makes extensive use of SQLite and acts as a viewer, allowing you to create interactive websites and APIs from arbitrary databases. He'd be a very long term plant and it instead seems far more likely he's interested in the possibilities of DuckDB compatibility within the context of Datasette and other similar projects.

[1]: https://github.com/simonw/datasette

I've been experimenting with SQL as an API language - including client-side SQL constructed in JavaScript - fir a couple of years with my Datasette project.

I'm using similar security tricks to you: read-only queries with a time limit, against SQLite rather than PostgreSQL.

More here: https://simonwillison.net/2018/Oct/4/datasette-ideas/ and https://github.com/simonw/datasette

Agreed. I have a hunch that any solution to this problem will look similar to https://github.com/simonw/datasette if it develops. Tied together with https://supabase.io/ and you've built a versatile, powerful, open data collaboration engine.
I've been experimenting with SQLite FTS as a way of adding search to an otherwise static site.

The big advantage of SQLite FTS is that it's really cheap to run. The index is a single static file on disk, then you add a Python process (I'm using https://github.com/simonw/datasette ) to run queries against it. Much less resource intensive than running Solr or Elasticsearch.

It also works surprisingly well - I've run FTS queries against tables that are up to around 10GB on disk and performance is great.

It's no way near as featureful as Lucene, but for small to medium sized projects it's easily good enough.

As for deployment: if the SQLite .db index file is small enough you can bundle it up as part of a static deployment, e.g. bundled in a Docker container. I've done this using Heroku, Google Cloud Run, https://fly.io/ and Zeit Now (aka Vercel).

If the content lives in a git repository you can hook up CI (or a GitHub Action) to build and publish a new copy of the SQLite index on every change.

I've started thinking of this pattern as a kind of static-dynamic site: there's dynamic server-side code but it's running in read-only containers, so you can scale it up by running more copies and if anything goes wrong you just restart the container.

https://til.simonwillison.net/ is my most recent site to use this pattern, see https://github.com/simonw/til for how it works.

I also wrote this tutorial describing the pattern a while ago: https://24ways.org/2018/fast-autocomplete-search-for-your-we...

For people who don't know what Datasette[0][1] is, it's a tool for exploring and publishing data, based on SQLite.

From the README:

> Datasette helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world.

[0] https://github.com/simonw/datasette

[1] http://datasette.readthedocs.io/

I've been building tools for this at https://github.com/dogsheep

The unifying idea is to convert data dumps from these kinds of companies into SQLite databases, then query and visualize them using https://github.com/simonw/datasette and https://github.com/simonw/datasette-vega

I like it! I've often wanted instant ephemeral DBs for testing and think SQLite is an excellent choice for it.

It might be nice to have a simple GUI for each DB. Perhaps leverage something like https://github.com/simonw/datasette. Also a few sample DBs. Cool stuff!

I've been experimenting recently with YAML for this kind of thing, and it's working out really well for me so far.

I have a tool called yaml-to-sqlite ( https://github.com/simonw/yaml-to-sqlite ) which converts a YAML file into a SQLite database, which I can then use with Datasette ( https://github.com/simonw/datasette )

My biggest project with it so far has been my site https://www.niche-museums.com/ - a guide to small and niche museums. The museums themselves live in a single ~100KB YAML file in GitHub: https://github.com/simonw/museums/blob/master/museums.yaml

I have a CI script which builds that YAML file into a SQLite database and deploys it + Datasette + custom templates to https://www.niche-museums.com/

I've been running the site like this for a few months now and I really like it. I love having my content in source control, I find editing the YAML to be reasonably pleasant (I even edit it on my iPhone sometimes using the Working Copy app) and any YAML errors are caught by CI before they are deployed.

I didn't want to come of as suggesting you meant any of that, not at all! Just comparing approaches.

I think it was this: https://github.com/simonw/datasette though supports only CSV files.

There's also Apache Drill which works with a lot of data sources, which is philosophically closer to OctoSQL.

For those who didn't know Datasette[0] either, it's a tool for exploring data. Apparently it's an API above SQLite.

[0]: https://github.com/simonw/datasette