It's unbelievable how hard to grasp distributed systems are. I recently implemented Paxos in Rust and at certain points I literally thought I was losing my mind.

When you read Paxos Made Simple it really all seems so, well, simple. But then you get inconsistent commits and look at the traces of what happened and just go "How?!"

One of the things that surprised me about this analysis was just how many bugs we found that had to do with the actual Raft implementation. Usually when I test Raft-based systems the bugs are at the edges--like the coupling of the system to the Raft library, treating it like an externally-queryable log rather than the driver of a state machine, and so on. We found integration bugs here too, but also a fair number of issues in the Raft library itself--and this is despite Redis-Raft having existing integration tests!

This stuff is hard!

Has anyone approached Jepsen about running an analysis on the Erlang Ra implementation? I believe they've been running Jepsen tests internally, just curious if they're thinking about getting an official analysis at some point. Thanks for all that you folks do!! * https://github.com/rabbitmq/ra