Is anyone else thinking, what the f*ck? Are we in a new era of computing? It certainly feels that way when looking at these desktop class ARM chips, where performance doubled every year or so, just like back in the 80s and 90s.
The CPU gains on ARM have been increasing consistently year over year for the past decade. People have posted benchmarks of the A11/ A12/ A13 versus Intel for a while so this has been pretty obvious. It's just surfacing because suddenly we have a desktop CPU with desktop software like compilers and other things where it's more obvious outside of benchmark tools.

Apple is just jumping onto their existing ARM track, once they migrate their product line, which has surpassed Intel. Once they've migrated all their lines to ARM, the performance gains will be more like they have been on the iPhone/ iPad over the past few years. Mostly 20-30%/ year.

IMO, this has little to do with it being ARM. 30 years ago ARM had a significant micro architectural advantage in performance per watt, but in this era of 10 billion transistor chips, that advantage has disappeared. x86_64 rationalized the x86 architecture and decode is such a small fraction of the power budget that it really doesn't matter anyways.

What does matter, IMO:

- assembling a killer team

- 5nm process

- high speed, low latency DRAM

- big-little

I'm no expert, but the only big architectural differences are a massively larger decoder and a reorder buffer that's several times as large as x86 designs.

If these are actually the reasons for the performance difference, and it's difficult to do these on x86 because of the instruction set, it seems to this amateur that ARM64 really does have an advantage over x86.

Don't forget ARM's more relaxed memory model vs. x86's TSO.
One of the reasons Rosetta 2 works so well is Apple silicon sticks to the more restricted x86 memory model.
Does it? Apple's documentation seems to disagree [1]:

"A weak memory ordering model, like the one in Apple silicon, gives the processor more flexibility to reorder memory instructions and improve performance, but doesn’t add implicit memory barriers."

[1] https://developer.apple.com/documentation/apple_silicon/addr...

It's switchable at runtime. Apple silicon can enable total store ordering on a per-thread basis while emulating x86_64, then turn it back off for maximum performance in native code.

Here's a kernel extension someone built to manipulate this feature: https://github.com/saagarjha/TSOEnabler