What does HackerNews think of TSOEnabler?

Kernel extension that enables TSO for Apple silicon processes

Language: C

I think it’s slightly more complex than that. Apple included in their processor design custom extensions to ARM that alter things like memory order to make emulation easier [1]. It’s not just control over the software stack that makes Rosetta so performant.

[1] https://github.com/saagarjha/TSOEnabler

>I don't really need Rob to explain to me how Apple's processors do TSO ;)

Lemme just look up TSO and...

https://github.com/saagarjha/TSOEnabler

...Oh. Fair enough, my mistake :P

Is there not an instruction to switch into TSO mode, though? Wouldn't that technically count? :P

M1 implements TSO as a special mode specifically for x86 compatibility: https://github.com/saagarjha/TSOEnabler

If TSO didn’t have a performance penalty, it wouldn’t need to be a separate mode. Also, it should be obvious that stricter ordering constraints inherently allow less parallelism, so lower performance.

It seems to be a CPU flag. This is managed by the kernel so that it is on for Rosetta 2 processes and off for everything else. See https://github.com/saagarjha/TSOEnabler
> Maybe the largest remaining difference is around the strength of the memory model

If it weren't for the following project... I'd agree with you.

https://github.com/saagarjha/TSOEnabler

> A kernel extension that enables total store ordering on Apple silicon, with semantics similar to x86_64's memory model. This is normally done by the kernel through modifications to a special register upon exit from the kernel for programs running under Rosetta 2; however, it is possible to enable this for arbitrary processes (on a per-thread basis, technically) as well by modifying the flag for this feature and letting the kernel enable it for us on. Setting this flag on certain processors can only be done on high-performance cores, so as a side effect of enabling TSO the kernel extension will also migrate your code off the efficiency cores permanently.

--------

Its clear that Apple has implemented total-store ordering on its chips (including the M1).

It's switchable at runtime. Apple silicon can enable total store ordering on a per-thread basis while emulating x86_64, then turn it back off for maximum performance in native code.

Here's a kernel extension someone built to manipulate this feature: https://github.com/saagarjha/TSOEnabler

> There are many issues in determining what is code, where are the branch destinations in case of indirect branches, etc.

Yes, handling indrect branch seems a bit complex and I'm not a specialist in the field. But I'm pretty sure that the cases of indirect branch are rare enough so that an additional indirection is relatively inexpensive. Adding a simple address mapping table should meet most of the cases.

An interesting question would also be whether Apple has added features to the hardware to improve the translation?

We know, for example, that Apple introduced a special register [1] to temporarily switch from the ARM consistency model to the TSO consistency model (Total Store Order) from x86.

[1] : https://github.com/saagarjha/TSOEnabler