> Could we do better? Assuredly. There are many AVX-512 that we are not using yet. We do not use ternary Boolean operations (vpternlog). We are not using the new powerful shuffle functions (e.g., vpermt2b). We have an example of coevolution: better hardware requires new software which, in turn, makes the hardware shine.
> Of course, to get these new benefits, you need recent Intel processors with adequate AVX-512 support
AVX-512 support can be confusing because it’s often referred to as a single instruction set.
AVX-512 is actually a large family of instructions that have different availability depending on the CPU. It’s not enough to say that a CPU has AVX-512 because it’s not a binary question. You have to know which AVX-512 instructions are supported on a particular CPU.
Wikipedia has a partial chart of AVX-512 support by CPU: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
Note that some instructions that are available in one generation of CPU are can actually be unavailable (superseded, usually) in the next generation of the CPU. If you go deep enough into AVX-512 optimization, you essentially end up targeting a specific CPU for the code. This is not a big deal if you’re deploying software to 10,000 carefully controlled cloud servers with known specifications, but it makes general use and especially consumer use much harder.
Are there good libraries for doing runtime feature detection? Eg, include three versions of hot function X in the binary, and have it seamlessly insert the correct function pointer at startup? Or have the function contain multiple bodies and just JMP to the correct block of code?
I know you can do this yourself, but last time I looked it was a heavily manual process— you had to basically define a plugin interface and dynamically load your selected implementation from a separate shared object. What are the barriers to having compilers able to be hinted into transparently generating multiple versions of key functions?