A similar thing is that I often come across people who are well-informed but still surprised that compilers will combine two observable operations into one, and complain that some other thread somehow 'should' be able to observe the intermediate state. But I don't understand how they think they would be able to tell the difference between an interleaving that never happens to happen, and one that will never happen.

It can be tricky in that fuzzing your program to discover race conditions seems to behave perfectly on one compilation target where these atomics are detected and reified, while you later discover on another platform surprisingly your app crashes 20% of the time.

Ideally we test our applications on all hardware, with all configuration permutations etc. etc. In practice we do rely on our compilers translating our intent accurately and sometimes such edge cases matter.

Compatibility is a tricky thing. It's kind of like the argument whether adding optional arguments to a function breaks BC. It doesn't break BC if you don't pass those parameters. But if for some reason you were passing extra parameters hoping they'd be ignored (for example as a handler some other place) then adding optional parameters WILL break BC and cause your software's behavior to be undefined.

It is not about translating intent accurately. What you need is for compilation to be fully specified. Anywhere where there is multiple correct ways to compile something introduces a risk for bugs that only show up with some compilers.

If you are a compiler or library writer, one solution is to avoid having useful properties that are not part of the spec. For instance, Go does not guarantee any particular iteration order for hashmaps; so they go out of there way to randomize the iteration order, thereby preventing developers from writing code that depends on a deterministic order.

In the case of threading, what you would need to do is essentially have a compiler/runtime that goes out of its way to order and time operation in a random/adversarial manner.

I've seen research that looks into doing this in a VM environment; which would be inhibited by the type of compiler optimizations being discussed. And others that modify the compiler itself to replace the concurrency primitives with runtime functions, that can then execute them in a fuzzed order.

Ultimately, fuzzing and testing can only give you confidence that what is being tested is mostly correct. It can never give you confidence that what is written is entirely correct. If you want confidence in the latter, you need to invest in some form of static analysis (which could either be built into the language, such as a type system, or be an external analysis tool). Ultimatly, writing even a moderately complicated program (by modern standards) with full confidence in its correctness would involve advancing the state of the art of the field by decades (if not longer).

For the most part, the field just doesn't care about programs being fully correct; and accept it as a fact of life that going onto new/untested platforms and configurations will introduce/expose bugs.

> In the case of threading, what you would need to do is essentially have a compiler/runtime that goes out of its way to order and time operation in a random/adversarial manner.

> others that modify the compiler itself to replace the concurrency primitives with runtime functions, that can then execute them in a fuzzed order.

Loom[1] is somewhat of that. It's a testing system (not a runtime for a full app) which tests multiple permutations of program execution (all possible permutations I think, limited by test case complexity or an optional "maximum thread switch count"), as well as modeling atomic/weak memory effects to some degree.

[1]: https://github.com/tokio-rs/loom