I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Google's initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can't count on memory being robust against cosmic rays, you also can't count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.

ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000.

Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.

Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.

saagarjha

There was an interesting challenge at DEF CON CTF a while back that tested this, actually. It turns out that it is possible to write x86 code that is 1-bit-flip tolerant–that is, a bit flip anywhere in its code can be detected and recovered from with the same output. Of course, finding the sequence took (or so I hear) something like 3600 cores running for a day to discover it ;)

rfoo

Nit: not for a day, more like 8 hours, and that's because we were lazy and somebody said he "just happened" to have a cluster with unbalanced resources (mainly used for deep learning, but all GPUs occupied with quite a lot CPUs / RAMs left), so we decided to brute force the last 16 bits :)

Also, the challenge host left useful state (which bit was flipped) in registers before running teams' code, without this I'm not sure if it is even possible.

exikyut

This sounds really cool and interesting.

Was any code dumped anywhere?

I found this which corroborates everything you're saying but provides no further details: https://www.cspensky.info/slides/defcon_27_shortman.pdf

saagarjha

Oh, hey, it's Chad's slides!

Coverage of the finals is usually much less detailed, unfortunately, since the number of teams is much smaller and the challenges don't necessarily go up. However, https://oooverflow.io/dc-ctf-2020-quals/ has a couple more writeups linked from it; https://dttw.tech/posts/SJ40_7MNS#proof-by-exhaustion from PPP and http://www.secmem.org/blog/2019/08/19/Shellcoding-and-Bitfli... from SeoulPlusBadass.

exikyut

I see. Thanks very much for this info.

Binary bitflip resilience is really cool. The radiation-hardened-quine idea (https://codegolf.stackexchange.com/questions/57257/radiation..., https://github.com/mame/radiation-hardened-quine) is cool, but these source-based approaches depend on a perfectly functioning and rather large (Ruby, V8, whole browser) binary stack. A bitflip-protected hex monitor or kernel, on the other hand...