What does HackerNews think of dud?

A lightweight CLI tool for versioning data alongside source code and building data pipelines.

Language: Go

Ask HN: How do your ML teams version datasets and models? | Sep 2023

I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized datasets (low 10s of GBs).

In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...

Makefile Tutorial | Sep 2023

Expand Context ↕

DVC and Dud do this, but they're more meant for data analysis projects than building software.

https://dvc.org/

https://github.com/kevin-hanselman/dud/

Data Version Control | Oct 2022

Expand Context ↕

You might be referring to me/Dud[0]. If you are, first off, thanks! I'd love to know more about what development progress you are hoping for. Is there a specific set of features that bar you from using Dud? As far as testing, Dud has a large and growing set of unit and integration tests[1] that are run in Github CI. I'll never have the same resources as Iterative/DVC, but my hope is that being open source will attract collaborators. PRs are always welcome ;)

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...

Tup – an instrumenting file-based build system | Aug 2022

Expand Context ↕

I very much agree with you about DVC's feature creep. The other issue I have with it is speed. DVC has left me scratching my head at its sluggishness many times. Because of these factors, I've been working on an alternative that focuses on simplicity and speed[0]. My tool is often five to ten times faster than DVC[1]. I'd love to hear what you think.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://kevin-hanselman.github.io/dud/benchmarks/

Eden | Apr 2022

Expand Context ↕

This is by no means a perfect match for your requirements, but I'll share a CLI tool I built, called Dud[0]. At the least it may spur some ideas.

Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].

With Dud I focused on speed and simplicity. To your three points above:

1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.

2) Dud checks out binaries as links by default, so it's super fast to switch between commits.

3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.

I hope this helps, and I'd be happy to chat about it.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://dvc.org

[2]: https://github.com/kevin-hanselman/dud#concrete-differences-...

Git-annex – Managing large files with Git | Jan 2022

Expand Context ↕

Thanks for sharing your experience. It's non-trivial and surprising behavior like this that drove me to build a custom system[0] myself.

When I started researching version control tools for large files, I remember feeling like git-annex and Git LFS were awkwardly bolted onto Git; Git simply wasn't designed for large files. Then I found DVC[1], and its approach rang true for me. However, after using DVC for a year or so, I grew tired of its many puzzling behaviors (most of which are outlined in the README at [0]). In the end, I built the tool I wanted for the job -- one that is exceptionally simple and fast.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://dvc.org/

Elfshaker: 400 GiB to 100 MiB, with 1s access time | Nov 2021

Expand Context ↕

Maybe not precisely what you want, but I built a CLI tool[1] that's like a simplified and decoupled Git-LFS. It tracks large files in a content-addressed directory, and then you track the references to that store in source control. Data compression isn't a top priority for my tool; it uses immutable symlinks, not archives.

[1]: https://github.com/kevin-hanselman/dud

Git as a Storage | Oct 2021

Expand Context ↕

I agree that decoupling from Git has its benefits, and I've built a tool[1] that seems to meet some of your needs above. The idea is to save binary data in a separate content-addressed store and have Git track references to specific files in said store. If you check it out, I'd be happy to hear what you think!

[1]: https://github.com/kevin-hanselman/dud

Ask HN: What VCS to use for binary files only? | Sep 2021

I've been working on Dud[1] to help solve this problem. Coincidentally I released v0.2.0 today.

The core idea is to version large files/directories in a content-addressed store, and to track the content addresses (i.e. checksums) in source control (e.g. Git) for historical traceability.

For a practical example of how Dud works, check out the Getting Started page[2].

[1]: https://github.com/kevin-hanselman/dud

[2]: https://kevin-hanselman.github.io/dud/getting_started/

Ask HN: What are some tools / libraries you built yourself? | May 2021

I built Dud[1] because I wanted a simpler, faster DVC[2]. If DVC is Django, I set out to build Flask. In using Dud myself, I think I've succeeded thus far. But the only way to know for sure is to publicize the project and get it in other people's hands. Here's to that. I'm planning on polishing the documentation and cutting a 0.1 release soon.

[1]: https://github.com/kevin-hanselman/dud

[2]: https://dvc.org/

Highlights from Git 2.31 | Mar 2021

Expand Context ↕

Maybe not exactly what you're looking for, but I'm building a tool that's meant to be a companion to Git for large files[1]. The core concept is to track large files/directories in a separate content-addressed store and have Git track references to said files. To your "I can't use them without setting up another server" comment, I'm making use of rclone[2] to replicate the file store to any reputable storage service/platform. If you already use S3, for instance, just set up a new bucket.

I'm happy to answer questions and take any feedback.

[1]: https://github.com/kevin-hanselman/dud

[2]: https://rclone.org/

SnowFS – a fast, scalable version control file storage for graphic files | Feb 2021

Very interesting. I'd like to learn more about how it works. How does this compare to DVC[1], for instance?

I'll throw in a shameless plug for my tool in this area, Dud[2]. Dud is to DVC what Flask is to Django.

Are the mentioned benchmarks published somewhere?

[1]: https://dvc.org [2]: https://github.com/kevin-hanselman/dud

Launch HN: Replicate (YC W20) – Version control for machine learning | Nov 2020

Expand Context ↕

You may also be interested in a simple tool I'm building that works in concert with source control to store, version, and reproduce large data: https://github.com/kevin-hanselman/dud

My project is in its infancy (open-sourced less than a month ago), but I'm pleased with its UX thus far. There's lots to add in terms of documentation, but Dud currently uses Rclone[1] for remote syncing.

[1]: https://rclone.org/