What does HackerNews think of dud?
A lightweight CLI tool for versioning data alongside source code and building data pipelines.
In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].
[0]: https://github.com/kevin-hanselman/dud
[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...
[0]: https://github.com/kevin-hanselman/dud
[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...
Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].
With Dud I focused on speed and simplicity. To your three points above:
1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.
2) Dud checks out binaries as links by default, so it's super fast to switch between commits.
3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.
I hope this helps, and I'd be happy to chat about it.
[0]: https://github.com/kevin-hanselman/dud
[1]: https://dvc.org
[2]: https://github.com/kevin-hanselman/dud#concrete-differences-...
When I started researching version control tools for large files, I remember feeling like git-annex and Git LFS were awkwardly bolted onto Git; Git simply wasn't designed for large files. Then I found DVC[1], and its approach rang true for me. However, after using DVC for a year or so, I grew tired of its many puzzling behaviors (most of which are outlined in the README at [0]). In the end, I built the tool I wanted for the job -- one that is exceptionally simple and fast.
[0]: https://github.com/kevin-hanselman/dud
[1]: https://dvc.org/
The core idea is to version large files/directories in a content-addressed store, and to track the content addresses (i.e. checksums) in source control (e.g. Git) for historical traceability.
For a practical example of how Dud works, check out the Getting Started page[2].
[1]: https://github.com/kevin-hanselman/dud
[2]: https://dvc.org/
I'm happy to answer questions and take any feedback.
[1]: https://github.com/kevin-hanselman/dud
[2]: https://rclone.org/
I'll throw in a shameless plug for my tool in this area, Dud[2]. Dud is to DVC what Flask is to Django.
Are the mentioned benchmarks published somewhere?
[1]: https://dvc.org [2]: https://github.com/kevin-hanselman/dud
My project is in its infancy (open-sourced less than a month ago), but I'm pleased with its UX thus far. There's lots to add in terms of documentation, but Dud currently uses Rclone[1] for remote syncing.
[1]: https://rclone.org/