I appreciate learning more about what, exactly, ARM's "weaker memory model" constitutes. It's clearer to me after reading this article.
I wonder how much the gain in performance of e.g. Apple's M1 chip, compared to an x86 CPU, can be attributed to this weaker constraint. Given that the M1 can outperform an x86 CPU even when emulating x86 code, perhaps it's not much.
Also, I suspect programming languages that are immutable by default will gain a larger advantage using ARM's weaker memory model, as the compiler can more often safely let the CPU perform reordering (due to not having to wait for a mutable variable being updated until it can execute a subsequent line of code which depends on this updated variable).
The M1 is outperforming an x86 CPU when emulating x86 code by translating that code into ARMv8 instructions. So if the x86 code was designed using x86's strong memory model and is then dynamically translated into ARMv8 and run with a weak memory model, there will be problems, no? Maybe Rosetta 2 handles this; that would be impressive.
No, Apple doesn't run the translated code with a weak memory model, Rosetta 2 toggles the total store ordering (which the M1 supports) on when running its code.
Microsoft's translator (and QEMU?) does insert extra barriers to make it run on a weak memory model, because they are designed to run on HW without TSO.
How does the toggle work? Is it per thread, CPU affinity, per process, CPU flag or something totally different?
It seems to be a CPU flag. This is managed by the kernel so that it is on for Rosetta 2 processes and off for everything else. See https://github.com/saagarjha/TSOEnabler