Processor "optimizations" can produce surprising effects. The problem is these optimizations are not programmatically accessible to C (or most modern programming languages) given their simple memory model. Deterministic performance is not easy to obtain. My view is to not bother with such tricks unless absolutely necessary (and be prepared that your changes may actually pessimize performance on a future processor or a compatible processor by a different vendor).

If you are interested in this sort of thing, check out comp.arch!

I've been researching this pretty deeply for the last few years, and I've come to the conclusion that, without a complete redesign, most popular programming languages cannot have direct control of these optimizations in an ergonomic manner.

The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.

To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.

I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:

https://github.com/trailofbits/vast

I'd definitely like to see the compilation processes be more transparent and easy to work with.