What does HackerNews think of differential-dataflow?

An implementation of differential dataflow using timely dataflow on Rust.

Language: Rust

I'm looking for this but can't find it, how does this project compare to differential dataflow?

As a sibling commenter mentioned, it's built on timely dataflow (which is lower-level), but that already has differential dataflow[0] built on top of it by the same authors.

How do they differ?

(btw. I think dataflow is very cool as a computing model (not just timely dataflow), to the point of building OctoSQL[1] around it, so I'm really curious about the details here)

[0]: https://github.com/TimelyDataflow/differential-dataflow

[1]: https://github.com/cube2222/octosql

hi! sujay from convex here. I remember reading about your "reverse query engine" when we were getting started last year and really liking that framing of the broadcast problem here.

as james mentions, we entirely re-run the javascript function whenever we detect any of its inputs change. incrementality at this layer would be very difficult, since we're dealing with a general purpose programming language. also, since we fully sandbox and determinize these javascript "queries," the majority of the cost is in accessing the database.

eventually, I'd like to explore "reverse query execution" on the boundary between javascript and the underlying data using an approach like differential dataflow [1]. the materialize folks [2] have made a lot of progress applying it for OLAP and readyset [3] is using similar techniques for OLTP.

[1] https://github.com/TimelyDataflow/differential-dataflow

[2] https://materialize.com/

[3] https://readyset.io/

I question the premise that MapReduce really ever went away. Many migrated away from Hadoop, but in frameworks that succeeded it, MapReduce was still a core pattern. And in some cases, moving away from Hadoop wasn't ideal because later frameworks still got some things wrong. Maybe we stopped talking about MapReduce because we were focused on new patterns and challenges -- how to support many complex jobs and pipelines, more interactive and exploratory analysis, etc.

I'm curious about the difference between "continuous MapReduce" and I guess a subgraph in a "differential dataflow" (which I have read about but never really used). https://github.com/TimelyDataflow/differential-dataflow

dataflow things like Flink (or even better differential datatflow [0]) are far more flexible and subsume map-reduce. This article feels like hyping up the durability of the Model T.

[0] https://github.com/TimelyDataflow/differential-dataflow

You can also inject pull queries at some intermediate location into the dataflow, either in the form of a Dirac-delta for a instant-in-time query, or persisting to receive push updates for the query results.

[0] makes it fairly easy to handle, via, say, inserting the persistent queries into a native table of it (which don't typically get persisted, though that could have changed in the past few months), and the keep-alive messages from the viewer bump the expire-at field in the query entry in the table. [1] is an example of how exactly the internal dataflow of such "demand-driven push" works, though it omits the expiry handling you'd want, and doesn't detail the interaction with the outside needed for live feeds.

What this doesn't emphasize is that due to the timestamping and consistent nature of the underlying dataflow engine, you can see when exactly the new query is in your output stream, and also e.g. bundle multiple queries together into one atomic transaction, to not get any tearing-style output/display artifacts.

The underlying Rust framework Differential Dataflow[2] is even more powerful, but also far less easy to use, due to the lack of a query optimizer. It arguably makes up by already supporting recursive/iterative computation and multi-temporal timestamps.

[0]: https://materialize.com/a-simple-and-efficient-real-time-app...

[1]: https://materialize.com/lateral-joins-and-demand-driven-quer...

[2]: https://github.com/TimelyDataflow/differential-dataflow

> In the simplest case, I'm talking about regular SQL non-materialized views which are essentially inlined.

I see that now -- makes sense!

> Wish we had some better database primitives to assemble rather than building everything on Postgres - its not ideal for a lot of things.

I'm curious to hear more about this! We agree that better primitives are required and that's why Materialize is written in Rust using using TimelyDataflow[1] and DifferentialDataflow[2] (both developed by Materialize co-founder Frank McSherry). The only relationship between Materialize and Postgres is that we are wire-compatible with Postgres and we don't share any code with Postgres nor do we have a dependence on it.

[1] https://github.com/TimelyDataflow/timely-dataflow [2] https://github.com/TimelyDataflow/differential-dataflow

What you're asking about is the magic at the heart of Materialize. We're built atop an open-source incremental compute framework called Differential Dataflow [0] that one of our co-founders has been working on for ten years or so.

The basic insight is that for many computations, when an update arrives, the amount of incremental compute that must be performed is tiny. If you're computing `SELECT count(1) FROM relation`, a new row arriving just increments the count by one. If you're computing a `WHERE` clause, you just need to check whether the update satisfies the predicate or not. Of course, things get more complicated with operators like `JOIN`, and that's where Differential Dataflow's incremental join algorithms really shine.

It's true that there are some computations that are very expensive to maintain incrementally. For example, maintaining an ordered query like

    SELECT * FROM relation ORDER BY col
would be quite expensive, because the arrival of a new value will change the ordering of all values that sort greater than the new value.

Materialize can still be quite a useful tool here, though! You can use Materialize to incrementally-maintain the parts of your queries that are cheap to incrementally maintain, and execute the other parts of your query ad hoc. This is in fact how `ORDER BY` already works in Materialize. A materialized view never maintains ordering, but you can request a sort when you fetch the contents of that view by using an `ORDER BY` clause in your `SELECT` statement. For example:

    CREATE MATERIALIZED VIEW v AS SELECT complicated FROM t1, t2, ... -- incrementally maintained
    SELECT * FROM v ORDER BY col LIMIT 5                              -- order and limit computed ad hoc, but still fast
[0]: https://github.com/TimelyDataflow/differential-dataflow
You might want to look at what DDflow [0] can do, in regards to handling emission/retraction.

If you figure out how this would work in the context of DDflow, let us know. This sounds very interesting.

[0]: https://github.com/TimelyDataflow/differential-dataflow