What does HackerNews think of TSOEnabler?

Nvidia to Challenge Intel with Arm-Based Processors for PCs | Oct 2023

I think it’s slightly more complex than that. Apple included in their processor design custom extensions to ARM that alter things like memory order to make emulation easier [1]. It’s not just control over the software stack that makes Rosetta so performant.

[1] https://github.com/saagarjha/TSOEnabler

Apple M2 Pro to use new 3nm process | Aug 2022

Expand Context ↕

>I don't really need Rob to explain to me how Apple's processors do TSO ;)

Lemme just look up TSO and...

https://github.com/saagarjha/TSOEnabler

...Oh. Fair enough, my mistake :P

Is there not an instruction to switch into TSO mode, though? Wouldn't that technically count? :P

Ask HN: Can competitors catch up to Apple Silicon? | Jul 2022

Expand Context ↕

M1 implements TSO as a special mode specifically for x86 compatibility: https://github.com/saagarjha/TSOEnabler

If TSO didn’t have a performance penalty, it wouldn’t need to be a separate mode. Also, it should be obvious that stricter ordering constraints inherently allow less parallelism, so lower performance.

ARM and Lock-Free Programming | Dec 2020

Expand Context ↕

It seems to be a CPU flag. This is managed by the kernel so that it is on for Rosetta 2 processes and off for everything else. See https://github.com/saagarjha/TSOEnabler

What do RISC and CISC mean in 2020? | Nov 2020

Expand Context ↕

> Maybe the largest remaining difference is around the strength of the memory model

If it weren't for the following project... I'd agree with you.

https://github.com/saagarjha/TSOEnabler

> A kernel extension that enables total store ordering on Apple silicon, with semantics similar to x86_64's memory model. This is normally done by the kernel through modifications to a special register upon exit from the kernel for programs running under Rosetta 2; however, it is possible to enable this for arbitrary processes (on a per-thread basis, technically) as well by modifying the flag for this feature and letting the kernel enable it for us on. Setting this flag on certain processors can only be done on high-performance cores, so as a side effect of enabling TSO the kernel extension will also migrate your code off the efficiency cores permanently.

--------

Its clear that Apple has implemented total-store ordering on its chips (including the M1).

World of Warcraft 9.0.2 client runs natively on Apple Silicon | Nov 2020

Expand Context ↕

Exhibit A: https://www.realworldtech.com/forum/?threadid=193883&curpost...

Exhibit B: https://github.com/saagarjha/TSOEnabler

16-inch MBP 2x slower than M1 MacBook Air in a real-world Rust compile | Nov 2020

Expand Context ↕

It's switchable at runtime. Apple silicon can enable total store ordering on a per-thread basis while emulating x86_64, then turn it back off for maximum performance in native code.

Here's a kernel extension someone built to manipulate this feature: https://github.com/saagarjha/TSOEnabler

Apple Silicon M1 Emulating x86 Is Still Faster Than Every Other Mac | Nov 2020

Expand Context ↕

> There are many issues in determining what is code, where are the branch destinations in case of indirect branches, etc.

Yes, handling indrect branch seems a bit complex and I'm not a specialist in the field. But I'm pretty sure that the cases of indirect branch are rare enough so that an additional indirection is relatively inexpensive. Adding a simple address mapping table should meet most of the cases.

An interesting question would also be whether Apple has added features to the hardware to improve the translation?

We know, for example, that Apple introduced a special register [1] to temporarily switch from the ARM consistency model to the TSO consistency model (Total Store Order) from x86.

[1] : https://github.com/saagarjha/TSOEnabler

Apple Silicon implements tricky x86 behaviors in hardware for faster emulation | Jul 2020

https://github.com/saagarjha/TSOEnabler