What does HackerNews think of differential-dataflow?
An implementation of differential dataflow using timely dataflow on Rust.
As a sibling commenter mentioned, it's built on timely dataflow (which is lower-level), but that already has differential dataflow[0] built on top of it by the same authors.
How do they differ?
(btw. I think dataflow is very cool as a computing model (not just timely dataflow), to the point of building OctoSQL[1] around it, so I'm really curious about the details here)
[0]: https://github.com/TimelyDataflow/differential-dataflow
as james mentions, we entirely re-run the javascript function whenever we detect any of its inputs change. incrementality at this layer would be very difficult, since we're dealing with a general purpose programming language. also, since we fully sandbox and determinize these javascript "queries," the majority of the cost is in accessing the database.
eventually, I'd like to explore "reverse query execution" on the boundary between javascript and the underlying data using an approach like differential dataflow [1]. the materialize folks [2] have made a lot of progress applying it for OLAP and readyset [3] is using similar techniques for OLTP.
I'm curious about the difference between "continuous MapReduce" and I guess a subgraph in a "differential dataflow" (which I have read about but never really used). https://github.com/TimelyDataflow/differential-dataflow
[0] makes it fairly easy to handle, via, say, inserting the persistent queries into a native table of it (which don't typically get persisted, though that could have changed in the past few months), and the keep-alive messages from the viewer bump the expire-at field in the query entry in the table. [1] is an example of how exactly the internal dataflow of such "demand-driven push" works, though it omits the expiry handling you'd want, and doesn't detail the interaction with the outside needed for live feeds.
What this doesn't emphasize is that due to the timestamping and consistent nature of the underlying dataflow engine, you can see when exactly the new query is in your output stream, and also e.g. bundle multiple queries together into one atomic transaction, to not get any tearing-style output/display artifacts.
The underlying Rust framework Differential Dataflow[2] is even more powerful, but also far less easy to use, due to the lack of a query optimizer. It arguably makes up by already supporting recursive/iterative computation and multi-temporal timestamps.
[0]: https://materialize.com/a-simple-and-efficient-real-time-app...
[1]: https://materialize.com/lateral-joins-and-demand-driven-quer...
[2]: https://github.com/TimelyDataflow/differential-dataflow
I see that now -- makes sense!
> Wish we had some better database primitives to assemble rather than building everything on Postgres - its not ideal for a lot of things.
I'm curious to hear more about this! We agree that better primitives are required and that's why Materialize is written in Rust using using TimelyDataflow[1] and DifferentialDataflow[2] (both developed by Materialize co-founder Frank McSherry). The only relationship between Materialize and Postgres is that we are wire-compatible with Postgres and we don't share any code with Postgres nor do we have a dependence on it.
[1] https://github.com/TimelyDataflow/timely-dataflow [2] https://github.com/TimelyDataflow/differential-dataflow
The basic insight is that for many computations, when an update arrives, the amount of incremental compute that must be performed is tiny. If you're computing `SELECT count(1) FROM relation`, a new row arriving just increments the count by one. If you're computing a `WHERE` clause, you just need to check whether the update satisfies the predicate or not. Of course, things get more complicated with operators like `JOIN`, and that's where Differential Dataflow's incremental join algorithms really shine.
It's true that there are some computations that are very expensive to maintain incrementally. For example, maintaining an ordered query like
SELECT * FROM relation ORDER BY col
would be quite expensive, because the arrival of a new value will change the ordering of all values that sort greater than the new value.Materialize can still be quite a useful tool here, though! You can use Materialize to incrementally-maintain the parts of your queries that are cheap to incrementally maintain, and execute the other parts of your query ad hoc. This is in fact how `ORDER BY` already works in Materialize. A materialized view never maintains ordering, but you can request a sort when you fetch the contents of that view by using an `ORDER BY` clause in your `SELECT` statement. For example:
CREATE MATERIALIZED VIEW v AS SELECT complicated FROM t1, t2, ... -- incrementally maintained
SELECT * FROM v ORDER BY col LIMIT 5 -- order and limit computed ad hoc, but still fast
[0]: https://github.com/TimelyDataflow/differential-dataflowIf you figure out how this would work in the context of DDflow, let us know. This sounds very interesting.
[0]: https://github.com/TimelyDataflow/differential-dataflow