I used to use this _extensively_ years ago. It sounds great in theory but very quickly gets in your way.

A while back I went through several of these kinds of build systems. The one I eventually settled on was Snakemake[1], with which I finally was able to put together an entire complex bioinformatics workflow[2]. I would say that Snakemake's strength is one of Perl's old mottos: "making easy things easy and hard things possible." On the easy side, writing a simple rule with fixed inputs and outputs is just as easy as Make (trading the weird whitespace rules of Make for the less weird whitespace rules of Python). On the hard side, it supports stuff like computing inputs at runtime, or using regex capture groups to determine output file names from input file names. So for example you can have your rule for LaTeX to PDF conversion read the source TeX file, determine all the external images, bib files, etc. referenced from it, and add those files as input dependencies so that when any of them changes, Snakemake will know to rebuild the PDF. See e.g. this rule[3] that builds my resume PDF from the LyX source.

[1]: https://snakemake.readthedocs.io/en/stable/

[2]: https://github.com/DarwinAwardWinner/CD4-csaw

[3]: https://github.com/DarwinAwardWinner/resume/blob/main/Snakef...

nerdponx

I really tried my best to use Snakemake in ~2018-2019 for day-to-day data science work. I found that its DSL was a confusing and (at least at the time) under-documented hybrid of Python and custom syntax. It also suffered from some of of the primary limitations of traditional Make:

• Targets/outputs must be files, without even the opportunity to extend the space of possible targets e.g. using your own hash comparison function. This is not suitable for a workflow where the output is a directory of filenames that aren't known in advance, or a database table.

• By default, can only run commands through a shell, not directly as a subprocess. At the same time, it also lacks support for nontrivial chaining and composition of commands. In your CD4 example you hack around this by literally writing your own pipelining command runner in Python; hard to say if that's any better than a big blob of shell script in a string!

I agree that it definitely "makes hard things possible", but my experience did not give me the impression that it "makes easy things easy". It was frustratingly deficient in what I believe are obvious areas for improvement, and some of the most obnoxious Make pain points that I encounter on a regular basis, even for regular day-to-day software development.

I ended up switching from Snakemake to DVC basically as soon as it came out, because it at least supports using directories as outputs, and has a few other nice features that specifically work well for data science project version control and artifact management/sharing for collaboration. That said, I had/have much less of a need for complicated workflow orchestration than for easy tracking & sharing of datasets and large artifacts (fitted models, processed tabular datasets), and for tracking directories as outputs (partitioned Parquet datasets).

It would be interesting to go back and re-evaluate the space of workflow DSLs and task runners, because DVC does not have the complete feature set that Snakemake has, and IMO nor should it (I am already getting worried about its feature creep). I'm also not sure if or how well Snakemake and DVC could be made to work together; the former for task running and the latter for task input/output tracking with version control. The perfect tool (or better yet, the perfect combination of tools) remains elusive.

kvnhn

I very much agree with you about DVC's feature creep. The other issue I have with it is speed. DVC has left me scratching my head at its sluggishness many times. Because of these factors, I've been working on an alternative that focuses on simplicity and speed[0]. My tool is often five to ten times faster than DVC[1]. I'd love to hear what you think.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://kevin-hanselman.github.io/dud/benchmarks/