Materialize has tackled the hardest problem in data warehousing, materialized views, which has never really been solved, and built a solution on a completely new architecture. This solution is useful by itself, but I'm also watching eagerly how their road map [1] plays out, as they go back and build out features like persistence and start to look more like a full-fledged data warehouse, but one with the first correct implementation of materialized views.

[1] https://materialize.com/blog-roadmap/

What exactly are "materialized views"?

jacques_chester

Suppose you have normalized your data schema, up to at least 3NF, perhaps even further up to 4NF, 5NF or (as Codd intended) BCNF.

Great! You are now largely liberated from introducing many kinds of anomaly at insertion time. And you'll often only need to write once for each datum (modulo implementation details like write amplification), because a normalised schema has "a place for everything and everything in its place".

Now comes time to query the data. You write some joins, and all is well. But a few things start to happen. One is that writing joins over and over becomes laborious. What you'd really like is some denormalised intermediary views, which transform the fully-normalised base schema into something that's more convenient to query. You can also use this to create an isolation layer between the base schema and any consumers, which will make future schema changes easier and possibly improve security.

The logical endpoint of doing so is the Data Warehouse (particularly in the Kimball/star schema/dimensional modelling style). You project your normalised data, which you have high confidence in, into a completely different shape that is optimised for fast summarisation and exploration. You use this as a read-only database, because it massively duplicates a lot of information that could otherwise have been derived via query (for example, instead of a single "date" field, you have fields for day of week, day of month, day of year, week of year, whether it's a holiday ... I've built tables which include columns like "days until major conference X" and "days since last quarterly release").

Now we reach the first problem. It's too slow! Projecting that data from the normalised schema requires a lot of storage and compute. You realise after some scratching that your goal all along was to pay that cost upfront so that you can reap the benefits at query time. What you want is a view that has the physical characteristics of a table. Meaning you want to write out the results of the query, but still treat it like a view. You've "materialized" the view.

Now the second problem. Who, or what, does that projection? Right now that role is filled by ETL, "Extract, Transform and Load". Extract from the normalised system, transform it into the denormalised version, then load that into a data warehouse. Most places do this on a regular cadence, such as nightly, because it just takes buckets and buckets of work to regenerate the output every time.

Now enters Materialize, who have a secret weapon: timely dataflow. The basic outcome is that instead of re-running an entire view query to regenerate the materialized view, they can, from a given datum, determine exactly what will change in the materialized view and only update that. That makes such views potentially thousands of times cheaper. You could even run the normalised schema and the denormalised projections on the same physical set of data -- no need for the overhead and complexity of ETL, no need to run two database systems, no need to wait (without the added complexity of a full streaming platform).

gen220

That's a great description! Does materialize describe how they implement timely dataflow?

At my current company, we have built some systems like this. Where a downstream table is essentially a function of a dozen upstream tables.

Whenever one of the upstream tables changes, it's primary key is published to a queue, some worker translates this upstream primary key into a set of downstream primary keys, and publishes these downstream primary keys to a compacted queue.

The compacted queue is read by another worker, that "recomputes" each dirty key, one-at-a-time, which involves fetching the latest-and-greatest version of each upstream table.

This last worker is the bottleneck, but it's optimized by per-key caching, so we only fetch the latest-and-greatest version once per update. It can also be safely and arbitrarily parallelized, since the stream they read from is partitioned on key.

scott_s

> Does materialize describe how they implement timely dataflow?

It's open source (https://github.com/TimelyDataflow/timely-dataflow), and also extensively written about both in academic research papers and documentation for the project itself. The GitHub repo has pointers to all of that. See also differential dataflow (https://github.com/timelydataflow/differential-dataflow).