Researcher/author of a tool [0,1] also attempting to tackle this problem here.

Unfortunately zero-downtime schema changes are even more complex than suggested here. Although the expand-contract method as described in the post is a good approach to tackling this problem, the mere act of altering a database table that is in active use is a dangerous one. I've already found that some trivial operations such as adding a new column to an existing table can block database clients from reading from that table through full table locks for the duration of the schema operation [2].

In many cases it's safer to create a new table, copy data over from the old table to the new table, and switch clients over. However this introduces a whole new set of problems: keeping data in sync between tables, "fixing" foreign key constraints, etc.

If there are others researching/building tooling for this problem, I'd love to hear from you.

[0] http://github.com/quantumdb/quantumdb

[1] https://speakerdeck.com/michaeldejong/icse-17-zero-downtime-...

[2] http://blog.minicom.nl/blog/2015/04/03/revisiting-profiling-...

Make the changes on a non-active replica database (assuming your database supports replication), then promote the replica to be primary. If you don’t have replicas, the concept of “zero downtime” is really at risk as your primary database becomes a single point of failure.

That's not great either. Moving the "master" role from one server to a replica isn't instantaneous (meaning you have a period of read-only state), and while your replica is performing the schema operation you still need to keep the data in sync with changes made on the master server (which to the best of my knowledge is non-trivial).

There are tools that "replay" data changes though: - FB's OSC (https://www.facebook.com/notes/mysql-at-facebook/online-sche...) - GH's gh-ost (https://github.com/github/gh-ost)