I think if x86 get rid of their legacy instructions, they could reduce their core size, and a few side effects will be performance per watt gains and maybe also performance (smaller cores), and if you need the legacy instructions you could just emulate, most consumer PCs don't need those.

That's the biggest difference with x86 and ARM, ARM got a lot of breaking changes with their versions, while x86 don't (I'm not sure if there was any breaking change in the las 20 years at least).

X86 will always be slower because it has stricter memory ordering semantics.
Ordering semantics (TSO) that Apple's M1 implements.

I guess it should be slower too?

M1 implements TSO as a special mode specifically for x86 compatibility: https://github.com/saagarjha/TSOEnabler

If TSO didn’t have a performance penalty, it wouldn’t need to be a separate mode. Also, it should be obvious that stricter ordering constraints inherently allow less parallelism, so lower performance.