I have built version control for data, on top of git itself, that can commit and push incremental diffs. By tagging in git, a version snapshot can be created. S3 can be configured (a) for heavy files and diffs referenced by pointer objects and (b) for shareable snapshots, identified by repo, tag name and commit sha. The diffs operate at row, column, and cell level, not by block deduping. Datasets must have some tabular structure. The data will go wherever pushed, and to user S3 if configured.

The burden of checking out and building snapshots from diff history is now borne by localhost, but that may change. Smart navigation of git history from the nearest available snapshots, building snapshots with Spark, and other ways to save on data transfer and compute are all on the table. Merge conflict resolution is in the works. This paradigm enables hibernating or cleaning up history on S3 for datasets no longer necessary to create snapshots, like those that are git removed if snapshots of earlier commits are not needed. Individual data entries could also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.

The prototype already cures the pain point I built it for: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary. Some background: I am building natural language AI algorithms that (+) operate on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (I know this sounds impossible), and (++) explain decisions back to individual training data. LLMs have fixed training datasets, whereas editable datasets call for a collaborative system to manage data efficiently.

I am open to everything including thoughts, suggestions, constructive criticism, and use case ideas.

Very cool!

Have you heard about dolt, which is also "Git for Data?"

https://github.com/dolthub/dolt

We also built dolthub, which is like github for dolt databases:

https://www.dolthub.com/