What does HackerNews think of dud?

A lightweight CLI tool for versioning data alongside source code and building data pipelines.

Language: Go

I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized datasets (low 10s of GBs).

In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...

DVC and Dud do this, but they're more meant for data analysis projects than building software.

https://dvc.org/

https://github.com/kevin-hanselman/dud/

You might be referring to me/Dud[0]. If you are, first off, thanks! I'd love to know more about what development progress you are hoping for. Is there a specific set of features that bar you from using Dud? As far as testing, Dud has a large and growing set of unit and integration tests[1] that are run in Github CI. I'll never have the same resources as Iterative/DVC, but my hope is that being open source will attract collaborators. PRs are always welcome ;)

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...

I very much agree with you about DVC's feature creep. The other issue I have with it is speed. DVC has left me scratching my head at its sluggishness many times. Because of these factors, I've been working on an alternative that focuses on simplicity and speed[0]. My tool is often five to ten times faster than DVC[1]. I'd love to hear what you think.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://kevin-hanselman.github.io/dud/benchmarks/

This is by no means a perfect match for your requirements, but I'll share a CLI tool I built, called Dud[0]. At the least it may spur some ideas.

Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].

With Dud I focused on speed and simplicity. To your three points above:

1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.

2) Dud checks out binaries as links by default, so it's super fast to switch between commits.

3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.

I hope this helps, and I'd be happy to chat about it.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://dvc.org

[2]: https://github.com/kevin-hanselman/dud#concrete-differences-...

Thanks for sharing your experience. It's non-trivial and surprising behavior like this that drove me to build a custom system[0] myself.

When I started researching version control tools for large files, I remember feeling like git-annex and Git LFS were awkwardly bolted onto Git; Git simply wasn't designed for large files. Then I found DVC[1], and its approach rang true for me. However, after using DVC for a year or so, I grew tired of its many puzzling behaviors (most of which are outlined in the README at [0]). In the end, I built the tool I wanted for the job -- one that is exceptionally simple and fast.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://dvc.org/

Maybe not precisely what you want, but I built a CLI tool[1] that's like a simplified and decoupled Git-LFS. It tracks large files in a content-addressed directory, and then you track the references to that store in source control. Data compression isn't a top priority for my tool; it uses immutable symlinks, not archives.

[1]: https://github.com/kevin-hanselman/dud

I agree that decoupling from Git has its benefits, and I've built a tool[1] that seems to meet some of your needs above. The idea is to save binary data in a separate content-addressed store and have Git track references to specific files in said store. If you check it out, I'd be happy to hear what you think!

[1]: https://github.com/kevin-hanselman/dud

I've been working on Dud[1] to help solve this problem. Coincidentally I released v0.2.0 today.

The core idea is to version large files/directories in a content-addressed store, and to track the content addresses (i.e. checksums) in source control (e.g. Git) for historical traceability.

For a practical example of how Dud works, check out the Getting Started page[2].

[1]: https://github.com/kevin-hanselman/dud

[2]: https://kevin-hanselman.github.io/dud/getting_started/

I built Dud[1] because I wanted a simpler, faster DVC[2]. If DVC is Django, I set out to build Flask. In using Dud myself, I think I've succeeded thus far. But the only way to know for sure is to publicize the project and get it in other people's hands. Here's to that. I'm planning on polishing the documentation and cutting a 0.1 release soon.

[1]: https://github.com/kevin-hanselman/dud

[2]: https://dvc.org/

Maybe not exactly what you're looking for, but I'm building a tool that's meant to be a companion to Git for large files[1]. The core concept is to track large files/directories in a separate content-addressed store and have Git track references to said files. To your "I can't use them without setting up another server" comment, I'm making use of rclone[2] to replicate the file store to any reputable storage service/platform. If you already use S3, for instance, just set up a new bucket.

I'm happy to answer questions and take any feedback.

[1]: https://github.com/kevin-hanselman/dud

[2]: https://rclone.org/

Very interesting. I'd like to learn more about how it works. How does this compare to DVC[1], for instance?

I'll throw in a shameless plug for my tool in this area, Dud[2]. Dud is to DVC what Flask is to Django.

Are the mentioned benchmarks published somewhere?

[1]: https://dvc.org [2]: https://github.com/kevin-hanselman/dud

You may also be interested in a simple tool I'm building that works in concert with source control to store, version, and reproduce large data: https://github.com/kevin-hanselman/dud

My project is in its infancy (open-sourced less than a month ago), but I'm pleased with its UX thus far. There's lots to add in terms of documentation, but Dud currently uses Rclone[1] for remote syncing.

[1]: https://rclone.org/