What does HackerNews think of differential-dataflow?

An implementation of differential dataflow using timely dataflow on Rust.

Language: Rust

Lets you define queries over some data set declaratively, and instead of recomputing the query over the entire data set every time you want an updated answer, it uses Differential Dataflow <https://github.com/frankmcsherry/differential-dataflow> to efficiently(^1) calculate the new results by updating the results of the previous query execution in response to new updates to the data set.

^1: I'm not an expert on Differential Dataflow, so I don't know what "efficiently" means in this context, other than "should be faster than running the query from scratch."

Claims to outperform Differential Dataflow[0] which underlies Materialize's[1] incremental materialized view product. Supports SQL out of the box. Definitely worth a deeper look. This is such an exciting area right now, I am envious of anybody writing streaming applications in five years time!

0: https://github.com/frankmcsherry/differential-dataflow

1: https://materialize.io/

No idea where this fits on the priority list, but I think a lot of problems around stream processing still aren't solved, and it's holding us back from a really productive programming paradigm. Handling updates/retractions elegantly is hard or impossible on many platforms, handling late (sometimes _extremely_ late) data can be very inefficient. Working with complex dependencies between events (beyond just time-based windows), in realtime, can be really tough. As the saying goes, cache invalidation is one of the hardest problems in software engineering. Having a simple platform to represent processing as a DAG, but fully supporting both short and long term changes transparently would make event sourcing architectures trivial and extremely productive. The closest we've come seems to be:

https://github.com/frankmcsherry/differential-dataflow

Lots of very active CS research in this area though.

Shameless plug for something that does all of these things (perhaps not exactly as you want, but ..)

https://github.com/frankmcsherry/differential-dataflow

I don't want to speak for all of dataflow-dom, but the main differences that I see (and exploit) are that data-parallel dataflow languages isolate control flow into small independent regions, making the larger computation data-driven. This does mean things are eager rather than lazy, but it also makes things much easier to parallelize (because of the independence) and much easier to incrementalize.[0]

[0]: https://github.com/frankmcsherry/differential-dataflow

You can contribute to:

https://github.com/frankmcsherry/timely-dataflow

https://github.com/frankmcsherry/differential-dataflow

Or, just tell your friends. :)

Better yet, write some python / pandas / dataframes / whatever_the_cool_kids_need layer on top, and rule the next big data drama cycle.

Shameless self-promotion for big data in Rust:

https://github.com/frankmcsherry/timely-dataflow

https://github.com/frankmcsherry/differential-dataflow

Nothing much useful to contribute, except that they mostly work, are largely unsafe-free (unsafe in some sorting, and a Drain replacement), and build on 1.0 stable.