What does HackerNews think of datasette?

An open source multi-tool for exploring and publishing data

Language: Python

#5 in Docker

#1 in JSON

#10 in Python

#4 in SQL

Implement Q&A against your documentation with GPT3 | Jan 2023

Opensource: https://github.com/simonw/datasette

Ask HN: What's the best way to create a database for legal document clauses? | Aug 2022

I would recommend SQLite + a nice usable interface tool like

datasette

https://github.com/simonw/datasette

Or SQLitebrowser https://sqlitebrowser.org/

(There are other generalized tools, search around to see if there are more suitable.)

You can store whole files, in addition to searchable text fields , and the single file is easier to share or turn into a more sophisticated “app” when the time is right.

Architecture Notes: Datasette | May 2022

Expand Context ↕

Opened an issue exploring alternatives here: https://github.com/simonw/datasette.io/issues/109

I decided to just drop "any size" but keep "any shape".

The Datasette Ecosystem | Apr 2022

Expand Context ↕

Interesting - I hadn't thought about putting that right on the homepage. The GitHub README has that: https://github.com/simonw/datasette

I worry that my target audience for the homepage won't necessarily have access to a working Python and pip, and so won't be able to just run that command without additional guidance (see https://docs.datasette.io/en/stable/installation.html )

That's why I emphasize trying out a hosted demo on the homepage instead.

The fastest tool for querying large JSON files is written in Python (benchmark) | Apr 2022

Expand Context ↕

  > * Which dialect for querying worked the best for you?

If this is SQL dialects, I am partial to both the Postgres JSON/JSONB operators, SQLite has a solid JSON implementation too.

  > * Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.

"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:

https://github.com/simonw/datasette

From the same author, "sqlite-utils" generate SQLite table definitions and rows from similar:

https://github.com/simonw/sqlite-utils

I find this useful outside of SQLite because the syntax is similar across SQL databases. Great way to bootstrap a DB definition from existing datasets.

"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"

  > * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.

For file, either JSONL/NDJSON (JSON object per line) if it's large, or a single array of objects if it's small.

In a SQL context, object-per-row undoubtedly

  > * What do you feel is the biggest gap/missing feature in currently available tools?

Don't have anything useful on this one unfortunately, but I'm not an analytics/data science person =(

Roapi: An API Server for Static Datasets | Oct 2021

What's the difference between this, and Datasette which has been in development multiple years by the Django co-creator?

Sorry if it's a dumb question but I skimmed the docs and sample and it looks similar + didn't see any mention:

https://github.com/simonw/datasette

JupyterLab Desktop App now available | Sep 2021

I'm working on an open-source desktop app [0] similar to Jupyter but more oriented toward non-data-science developer workflows like querying/joining data from multiple databases and scripting and graphing for internal reporting.

Another similar tool, but maybe more data science focused, is Simon Willison's Datasette [1].

[0] https://github.com/multiprocessio/datastation

[1] https://github.com/simonw/datasette

Schema on write is better to live by | Aug 2021

I've come around to almost the opposite approach.

I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - https://dogsheep.github.io/

For my own personal Dogsheep (https://simonwillison.net/2020/Nov/14/personal-data-warehous...) that's 119 tables right now.

Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.

If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.

This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.

I even have a search engine for this, which is populated by SQL queries against the different source tables. You can see an example of how that works at https://github.com/simonw/datasette.io/blob/main/templates/d... - which powers the search interface at https://datasette.io/-/beta

A Unix-style personal search engine and web crawler for your digital footprint | Jul 2021

My version of this is https://dogsheep.github.io/ - the idea is to pull your digital footprint from various different sources (Twitter, Foursquare, GitHub etc) into SQLite database files, then run Datasette on top to explore them.

On top of that I built a search engine called Dogsheep Beta which builds a full-text search index across all of the different sources and lets you search in one place: https://github.com/dogsheep/dogsheep-beta

You can see a live demonstration of that search engine on the Datasette website: https://datasette.io/-/beta?q=dogsheep

The key difference I see with Apollo is that Dogsheep separates fetching of data from search and indexing, and uses SQLite as the storage format. I'm using a YAML configuration to define how the search index should work: https://github.com/simonw/datasette.io/blob/main/templates/d... - it defines SQL queries that can be used to build the index from other tables, plus HTML fragments for how those results should be displayed.

Using sqlite3 as a notekeeping document graph | Jul 2021

Expand Context ↕

I've been exploring this technique more over the past year and I really like it - https://datasette.io (code at https://github.com/simonw/datasette.io ) is a more recent and much more complicated example.

Extracting links from markdown and using them to populate some additional columns or tables at build time would be pretty straight forward.

Hosting SQLite databases on GitHub Pages (or any static file hoster) | May 2021

Expand Context ↕

This looks pretty efficient. Some chains can be interacted with without e.g. web3.js? LevelDB indexes aren't SQLite.

Datasette is one application for views of read-only SQLite dbs with out-of-band replication. https://github.com/simonw/datasette

There are a bunch of *-to-sqlite utilities in corresponding dogsheep project.

Arrow JS for 'paged' browser client access to DuckDB might be possible and faster but without full SQLite SQL compatibility and the SQLite test suite. https://arrow.apache.org/docs/js/

https://duckdb.org/ :

> Direct Parquet & CSV querying

In-browser notebooks like Pyodide and Jyve have local filesystem access with the new "Filesystem Access API", but downloading/copying all data to the browser for every run of a browser-hosted notebook may not be necessary. https://web.dev/file-system-access/

Ask HN: What novel tools are you using to write web sites/apps? | Apr 2021

I've been using my Datasette tool, originally intended for exploring databases, to build full-blown websites and it's working surprisingly well. It supports Jinja templates and runs on SQLite databases so it's actually a good fit for semi-static sites.

Here's what Datasette looks like by default:

- https://latest.datasette.io

And here are three websites that are actually just Datasette with some custom templates and plugins:

- https://datasette.io (data: https://datasette.io/content code: https://github.com/simonw/datasette.io )

- https://til.simonwillison.net (data: https://til.simonwillison.net/tils code: https://github.com/simonw/til )

- https://www.niche-museums.com (data: https://www.niche-museums.com/browse code: https://github.com/simonw/museums )

They're all hosted on either Vercel or Google Cloud Run at the cost of no more than a few dollars a month.

SQLite is not a toy database | Mar 2021

Expand Context ↕

Not "in lieu", but I have found SQLite to be a good companion to pandas. pandas is good when you want to lend some easily navigable structure to your data without leaving python, but becomes cumbersome (it feels cumbersome esp if you know SQL) when you want to do some serious processing. At such times, I just dump my data into a SQLite DB, and from then on my code has SQL queries to it, which I find are far more readable than corresponding pandas statements (also easier for the reviewer).

Another advantage you get is if you wanted to look at these intermediate data, you don't need to run your code in debug mode and view the dataframe at a breakpoint - you can use something like datasette[1] or a standard SQLite DB viewer.

So a function that does complex data processing has approximately this structure my code:

  (1) 
  (2) 
  (3) 
  (4)

Step (3) used to be pandas for me before, but depending on how complex your operations are this can become hard to read and/or review.

[1] https://github.com/simonw/datasette - datasette doesnt replace a standard DB IDE, but is a very good lightweight alternative to one if don't intend to perform updates/inserts directly on a table.

SQLite is not a toy database | Mar 2021

With no sense of overstatement here, SQLite is one of my favorite creations in the entire world, so I have a bunch of links some of you might find interesting if you want to dig further:

https://github.com/sql-js/sql.js - SQL.js lets you run SQLite within a Web page as it's just SQLite compiled to JS with Emscripten.

https://litestream.io/blog/why-i-built-litestream/ - Litestream is a SQLite-powered streaming replication system.

https://sqlite.org/lang_with.html#rcex3 - you can do graph-style queries against SQLite too (briefly mentioned in the article).

https://github.com/aergoio/aergolite - AergoLite is replicated SQLite but secured by a blockchain.

https://github.com/simonw/datasette - Datasette (mentioned at the very end of the OP article) is a tool for offering up an SQLite database as a Web accessible service - you can do queries, data analysis, etc. on top of it. I believe Simon, the creator, frequents HN too and is a true SQLite power user :-)

https://dogsheep.github.io/ - Dogsheep is a whole roster of tools for doing personal analytics (e.g. analyzing your GitHub or Twitter use, say) using SQLite and Datasette.

What's New in SQLite 3.35 | Mar 2021

Expand Context ↕

I run SQLite in serverless environments (Cloud Run, Vercel, Heroku) for dozens of projects... but the trick is that they all treat the database as a read-only asset.

If I want to deploy updated data, I build a brand new image and deploy the application bundled with the data. I tend to run the deploys for these (including the database build) in GitHub Actions workflows.

This works really well, but only for applications that don't need to apply updates more than a few times an hour! If you have a constant stream of updates I still think you're better off using a hosted database like Heroku PostgreSQL or Google Cloud SQL.

One example of a site I deploy like that is https://datasette.io/ - it's built and deployed by this GitHub Actions workflow here: https://github.com/simonw/datasette.io/blob/main/.github/wor...

Datasette: An open source multi-tool for exploring and publishing data | Dec 2020

This is the new Datasette project website - the open source project is three years old now but I launched this website yesterday. Up until now there's just been the GitHub repo at https://github.com/simonw/datasette and the project documentation at https://docs.datasette.io/

The https://datasette.io/ site is running on Datasette itself, with a bunch of custom templates. If you're interested in seeing how it works the source code (and the GitHub Actions that build and deploy the underlying database) can be found here: https://github.com/simonw/datasette.io

Datasette: An open source multi-tool for exploring and publishing data | Dec 2020

Expand Context ↕

It's accepting issues and PRs: https://github.com/simonw/datasette

SQL.js: SQLite Compiled to JavaScript | Nov 2020

Expand Context ↕

I've actually prototyped this recently, but never got around to packaging it up into a library.

There are a few neat adjacent things that it can be used for (e.g. opening big SQLite files on disk without reading it all to memory), building something like Datasette (https://github.com/simonw/datasette) that can run queries on data hosted as static files, or for (as you suggested) using SQLite in a browser with persistence.

For that particular use case, there's a bit of complexity related to mutexes and stuff in trying to prevent simultaneous browser tabs doing write operations from corrupting the database.

Ask HN: Good tool to build a web interface on top of SQL Data | Oct 2020

For a quick web interface, I’ve used this on top of sqlite and it’s fantastic:

https://github.com/simonw/datasette

DuckDB – An embeddable SQL database like SQLite, but supports Postgres features | Sep 2020

A giant plant as in someone from their team planning this out and then planting this as a comment..?

simonw wrote Datasette[1] which makes extensive use of SQLite and acts as a viewer, allowing you to create interactive websites and APIs from arbitrary databases. He'd be a very long term plant and it instead seems far more likely he's interested in the possibilities of DuckDB compatibility within the context of Datasette and other similar projects.

[1]: https://github.com/simonw/datasette

Show HN: Octo – Generate a Serverless API from an SQL Query Connected to Any DB | Sep 2020

Reminds me of the venerable Datasette by Simon Willison: https://github.com/simonw/datasette

Show HN: Dropbase 2.0 – Turn offline files into live databases | Aug 2020

My favorite incarnation of this idea is probably Datasette.

https://github.com/simonw/datasette

We Can Do Better Than SQL | Aug 2020

Expand Context ↕

I've been experimenting with SQL as an API language - including client-side SQL constructed in JavaScript - fir a couple of years with my Datasette project.

I'm using similar security tricks to you: read-only queries with a time limit, against SQLite rather than PostgreSQL.

More here: https://simonwillison.net/2018/Oct/4/datasette-ideas/ and https://github.com/simonw/datasette

Social Movements Are Pushing Google Sheets to the Breaking Point | Aug 2020

Expand Context ↕

Agreed. I have a hunch that any solution to this problem will look similar to https://github.com/simonw/datasette if it develops. Tied together with https://supabase.io/ and you've built a versatile, powerful, open data collaboration engine.

Static Sites with Elasticsearch | May 2020

I've been experimenting with SQLite FTS as a way of adding search to an otherwise static site.

The big advantage of SQLite FTS is that it's really cheap to run. The index is a single static file on disk, then you add a Python process (I'm using https://github.com/simonw/datasette ) to run queries against it. Much less resource intensive than running Solr or Elasticsearch.

It also works surprisingly well - I've run FTS queries against tables that are up to around 10GB on disk and performance is great.

It's no way near as featureful as Lucene, but for small to medium sized projects it's easily good enough.

As for deployment: if the SQLite .db index file is small enough you can bundle it up as part of a static deployment, e.g. bundled in a Docker container. I've done this using Heroku, Google Cloud Run, https://fly.io/ and Zeit Now (aka Vercel).

If the content lives in a git repository you can hook up CI (or a GitHub Action) to build and publish a new copy of the SQLite index on every change.

I've started thinking of this pattern as a kind of static-dynamic site: there's dynamic server-side code but it's running in read-only containers, so you can scale it up by running more copies and if anything goes wrong you just restart the container.

https://til.simonwillison.net/ is my most recent site to use this pattern, see https://github.com/simonw/til for how it works.

I also wrote this tutorial describing the pattern a while ago: https://24ways.org/2018/fast-autocomplete-search-for-your-we...

GitHub uses Datasette to publish an SQL-enabled interface of Covid-19 data | Apr 2020

For people who don't know what Datasette[0][1] is, it's a tool for exploring and publishing data, based on SQLite.

From the README:

> Datasette helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world.

[0] https://github.com/simonw/datasette

[1] http://datasette.readthedocs.io/

Facebook's Download-Your-Data Tool Is Incomplete | Mar 2020

Expand Context ↕

I've been building tools for this at https://github.com/dogsheep

The unifying idea is to convert data dumps from these kinds of companies into SQLite databases, then query and visualize them using https://github.com/simonw/datasette and https://github.com/simonw/datasette-vega

Show HN: SimpleSQL, SQLite Database Reachable over HTTP | Feb 2020

I like it! I've often wanted instant ephemeral DBs for testing and think SQLite is an excellent choice for it.

It might be nice to have a simple GUI for each DB. Perhaps leverage something like https://github.com/simonw/datasette. Also a few sample DBs. Cool stuff!

GNU Recutils | Jan 2020

I've been experimenting recently with YAML for this kind of thing, and it's working out really well for me so far.

I have a tool called yaml-to-sqlite ( https://github.com/simonw/yaml-to-sqlite ) which converts a YAML file into a SQLite database, which I can then use with Datasette ( https://github.com/simonw/datasette )

My biggest project with it so far has been my site https://www.niche-museums.com/ - a guide to small and niche museums. The museums themselves live in a single ~100KB YAML file in GitHub: https://github.com/simonw/museums/blob/master/museums.yaml

I have a CI script which builds that YAML file into a SQLite database and deploys it + Datasette + custom templates to https://www.niche-museums.com/

I've been running the site like this for a few months now and I really like it. I love having my content in source control, I find editing the YAML to be reasonably pleasant (I even edit it on my iPhone sometimes using the Working Copy app) and any YAML errors are caught by CI before they are deployed.

Show HN: ESQLate – Build minimum viable admin panels quickly with just SQL | Jan 2020

Expand Context ↕

I didn't want to come of as suggesting you meant any of that, not at all! Just comparing approaches.

I think it was this: https://github.com/simonw/datasette though supports only CSV files.

There's also Apache Drill which works with a lot of data sources, which is philosophically closer to OctoSQL.

Datasette-atom: Define an Atom feed using a custom SQL query | Dec 2019

For those who didn't know Datasette[0] either, it's a tool for exploring data. Apparently it's an API above SQLite.

[0]: https://github.com/simonw/datasette