I sometimes write code portable across SSE4 and NEON, and I'm not sure this is going to work fast enough for that. There're important unique features.

SSE has shuffles, pack/unpack, movemask, 64-bit doubles, testzero, float rounds, blends, integer averages, float square roots and dot product.

NEON has interleaved RAM load/stores, vectors operators with scalar other argument, byte swap, rotate, bit scan and population count, and versions of all instructions processing 8-byte long vectors.

That's enough differences that I have to adjust both algorithms and data structures to be portable between them. I'm not convinced it's possible to do automatically.

Shuffle, pack/unpack, movemask, blends (SSE/AVX) and interleaved load/stores, byte swap (NEON) are "just" data-movement instructions.

All of them can be implemented (with obvious slowdowns) with a conditional write to memory, then a conditional read from memory. Yeah, its inefficient to do it like this, but this "write then read" pattern really gives us an idea of what's really going on between the registers in a pack/pshufb/whatever instruction.

On AMD GPUs, there's a fully arbitrary crossbar between SIMD-lanes allowing for arbitrary movement. The two instructions are just "permute" and "b-permute" (backwards permute), roughly correlating to gather and scatter respectively.

On NVidia GPUs, perm and bperm are both implemented in PTX, but instead read/write to L1 or __shared__ memory. NVidia GPUs likely have a crossbar to L1 memory to make this instruction very fast.

---------

The solution is to implement perm and bperm on AVX. Its already half-implemented: pshufb is equivalent to GPU-permute. CPUs are just missing the backwards permute.

I'm pretty confident that pack/unpack, blends, interleaved load/stores, and more could all be implemented as pshufb and a hypothetical "backwards pshufb". Version 1.0 could be an NVidia-like "write to L1 cache" sort of implementation too, if full crossbars are too expensive at the hardware layer.

-----------

So the question is: how should we write code today? CPUs of today do not implement this feature, but CPUs of the future might. I think specifying the memory-moves explicitly, and then working on a "pshufb compiler/optimizer" of sorts is what we need.

> how should we write code today?

Write in a GPU-style "compute shader" language and let the compiler pick the fastest ways of doing things on each ISA?

https://github.com/ispc/ispc