I guess I am inspired by Dolt’s ability to branch and merge: https://github.com/dolthub/dolt
Have you heard about dolt, which is also "Git for Data?"
https://github.com/dolthub/dolt
We also built dolthub, which is like github for dolt databases:
https://github.com/dolthub/dolt
And that has a user-friendly UI in DoltHub:
You wouldn't store the images themselves in Dolt, those would likely be links to S3 but al the labels and surrounding metadata could be stored in Dolt?
DISCLAIMER: I'm the CEO of DoltHub so this is self-promotion.
Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.
https://github.com/dolthub/dolt
Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.
Before you managed to produce mvsqlite, I was wondering if it is possible to rebase https://github.com/dolthub/dolt content-addressed page store(implemented with ProllyTree over a standard OS FS) onto FDB, so that there will be a MySQL-compat DB with similar properties to mvsqlite where actual page updates can be done outside FDB transactions to overcome 5sec FDB limit. Apparently, you already materialized a similar idea in a more sophisticated, practical, and complete way.
Thanks a lot for clarifying and keep up the great work. Your work is totally awesome!
dolt clone
dolt pull
dolt push
dolt checkout
dolt branch
dolt commit
dolt merge
dolt blame
dolt diff
mgramin/awesome-db-tools > schema > changes:
https://github.com/mgramin/awesome-db-tools#changesEthicalML/awesome-production-machine-learning#model-and-data-versioning: https://github.com/EthicalML/awesome-production-machine-lear...
https://github.com/terminusdb/terminusdb https://github.com/dolthub/dolt
This also reminds me of Dolt: https://github.com/dolthub/dolt which I believe has been on HN a couple times
First, use advances in privacy technology to create a service-wide data warehouse that has enough information to help you make good decisions without exposing any specific user’s data. Done properly, users will benefit from your improved decision-making without giving up their personal data. Differential Privacy can do this.
Second, give users the opportunity to download their own little database in native format (e.g. SQLite) This is the ultimate in data portability. I think Dolt [0] might be good for this, because its git-like approach gives you push/pull syncing as well as diffing. That would make it easy for users to keep a local copy of the data up to date.
Third, you can start to support self-hosting and perhaps even open-source the primary user-facing application. The hosted service sells convenience and features enabled by the privacy-respecting data warehouse.
The big questions, of course, are many:
- Would users pay for this?
- Does increased development cost and reduced velocity outweigh the privacy benefits?
- Would the open-source component enable clones that undermine your business, or attract new users who may eventually upgrade to your paid service?
I would like to find out the answers!
There are React Native forks for Windows, MacOS and Linux. I have no idea whether any of them is "good implementation" though.
> SQL + realtime computed views (eg materialize)
ClickHouse (OLAP DB) has materialized views (but only for inserts). Also Oracle and (I guess!) Materialize DB should have it too.
> Desktop apps that can be run without needing to be installed. (Like websites, but with native code.)
AppImage (and maybe Snap and Flatpak) is like this. Also technically, with Nix you can just run something like
nix-shell -p chromium --command chromium
(without root), but it feels like cheating.> Git but for data
https://github.com/dolthub/dolt (again, never tried it yet, but would like in future)
Dolthub is a Series A startup ($7m raised) building git for data. Our core project is an open source database called dolt (https://github.com/dolthub/dolt) that allows git like version control for databases (branches and merges). Dolthub (https://www.dolthub.com/) is built on top of the Dolt datastore and is seeking to build a datahub that allows opensource data collaboration at scale.
We are looking for mid to senior level devs who are passionate about scaling Dolthub and making it the defacto platform for data collaboration. Some work may include getting datasets on to Dolt via ETL/Airflow jobs or even working on the core Dolt DB technology. Our team is filled with rockstar distinguished/senior engineers from places like Amazon, Snapchat, and Google. You will learn a lot.
Stack: Go, Node.js, Next.js, React.js
See full jd here: https://www.linkedin.com/jobs/view/2166939708/?refId=4890099...
Questions: Reply to this thread or shoot an email to [email protected]
Unfortunately, we cannot sponsor H1Bs or F-1 OPT candidates.