What does HackerNews think of dolt?

Dolt – Git for Data

Language: Go

#4 in Database
#7 in Go
#6 in MySQL
#8 in SQL
Interesting that branching is now better supported and almost free. I wonder if merging can be simplified or whether it already is as simple and as fast as it can be?

I guess I am inspired by Dolt’s ability to branch and merge: https://github.com/dolthub/dolt

Very cool!

Have you heard about dolt, which is also "Git for Data?"

https://github.com/dolthub/dolt

We also built dolthub, which is like github for dolt databases:

https://www.dolthub.com/

If you are just looking for data versioning there is Dolt:

https://github.com/dolthub/dolt

And that has a user-friendly UI in DoltHub:

https://www.dolthub.com/

You wouldn't store the images themselves in Dolt, those would likely be links to S3 but al the labels and surrounding metadata could be stored in Dolt?

DISCLAIMER: I'm the CEO of DoltHub so this is self-promotion.

Founder of DoltHub here. One of my team pointed me at this thread. Congrats on the launch. Great to see more folks tackling the data versioning problem.

Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.

https://github.com/dolthub/dolt

Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.

Versioned pages backed by content-addressed store and transactions over the page index rather than pages! That totally makes sense to me.

Before you managed to produce mvsqlite, I was wondering if it is possible to rebase https://github.com/dolthub/dolt content-addressed page store(implemented with ProllyTree over a standard OS FS) onto FDB, so that there will be a MySQL-compat DB with similar properties to mvsqlite where actual page updates can be done outside FDB transactions to overcome 5sec FDB limit. Apparently, you already materialized a similar idea in a more sophisticated, practical, and complete way.

Thanks a lot for clarifying and keep up the great work. Your work is totally awesome!

Sweet. How tightly is this coupled to SQLite? I’d like to embed Dolt [0] instead, specifically for the Dolthub-backed collaboration model.

[0] https://github.com/dolthub/dolt

dolthub/dolt https://github.com/dolthub/dolt:

  dolt clone
  dolt pull
  dolt push
  dolt checkout
  dolt branch
  dolt commit
  dolt merge

  dolt blame
  dolt diff
mgramin/awesome-db-tools > schema > changes: https://github.com/mgramin/awesome-db-tools#changes

EthicalML/awesome-production-machine-learning#model-and-data-versioning: https://github.com/EthicalML/awesome-production-machine-lear...

Few git-inspired version controlled databases out there if performance becomes an issue. Dolt & TerminusDB are the most prominent.

https://github.com/terminusdb/terminusdb https://github.com/dolthub/dolt

You should check out dolt, does exactly what you're describing, and is a drop-in MySQL replacement:

https://github.com/dolthub/dolt

I’ve been following the progress of Dolt [1] which is a SQL database that works like git. This would give you modification history in a similar way to git. That’s different from recording when events happened, though (and changing your mind about when they happened), so you’ll still need timestamps for that.

[1] https://github.com/dolthub/dolt

I've been very curious to explore this type of use case with askgit (https://github.com/augmentable-dev/askgit) which was designed for running simple "slice and dice" queries and aggregations on git history (and change stats) for basic analytical purposes. I've been curious about how this could be applied to a small text+git based "db". Say, for a regular json or CSV dumps.

This also reminds me of Dolt: https://github.com/dolthub/dolt which I believe has been on HN a couple times

Or for a SQL database with Git versioning semantics:

https://github.com/dolthub/dolt

I think this is an excellent architecture for powerful, respectful, hosted applications. I’ve been thinking about a few extensions of this idea:

First, use advances in privacy technology to create a service-wide data warehouse that has enough information to help you make good decisions without exposing any specific user’s data. Done properly, users will benefit from your improved decision-making without giving up their personal data. Differential Privacy can do this.

Second, give users the opportunity to download their own little database in native format (e.g. SQLite) This is the ultimate in data portability. I think Dolt [0] might be good for this, because its git-like approach gives you push/pull syncing as well as diffing. That would make it easy for users to keep a local copy of the data up to date.

Third, you can start to support self-hosting and perhaps even open-source the primary user-facing application. The hosted service sells convenience and features enabled by the privacy-respecting data warehouse.

The big questions, of course, are many:

- Would users pay for this?

- Does increased development cost and reduced velocity outweigh the privacy benefits?

- Would the open-source component enable clones that undermine your business, or attract new users who may eventually upgrade to your paid service?

I would like to find out the answers!

[0] https://github.com/dolthub/dolt

People that are interested in a similar feature set should check out https://github.com/attic-labs/noms and the SQL fork of Noms, https://github.com/dolthub/dolt
> HTML/React inspired UI library that works on all platforms, so we can do electron without wasting 99% of my CPU cycles.

There are React Native forks for Windows, MacOS and Linux. I have no idea whether any of them is "good implementation" though.

> SQL + realtime computed views (eg materialize)

ClickHouse (OLAP DB) has materialized views (but only for inserts). Also Oracle and (I guess!) Materialize DB should have it too.

> Desktop apps that can be run without needing to be installed. (Like websites, but with native code.)

AppImage (and maybe Snap and Flatpak) is like this. Also technically, with Nix you can just run something like

    nix-shell -p chromium --command chromium
(without root), but it feels like cheating.

> Git but for data

https://github.com/dolthub/dolt (again, never tried it yet, but would like in future)

Dolthub | Los Angeles, CA | Full Time | REMOTE (ONSITE WHEN SAFE)| Full-Stack Engineers

Dolthub is a Series A startup ($7m raised) building git for data. Our core project is an open source database called dolt (https://github.com/dolthub/dolt) that allows git like version control for databases (branches and merges). Dolthub (https://www.dolthub.com/) is built on top of the Dolt datastore and is seeking to build a datahub that allows opensource data collaboration at scale.

We are looking for mid to senior level devs who are passionate about scaling Dolthub and making it the defacto platform for data collaboration. Some work may include getting datasets on to Dolt via ETL/Airflow jobs or even working on the core Dolt DB technology. Our team is filled with rockstar distinguished/senior engineers from places like Amazon, Snapchat, and Google. You will learn a lot.

Stack: Go, Node.js, Next.js, React.js

See full jd here: https://www.linkedin.com/jobs/view/2166939708/?refId=4890099...

Questions: Reply to this thread or shoot an email to [email protected]

Unfortunately, we cannot sponsor H1Bs or F-1 OPT candidates.