This is a great post showing why you have to measure the specific tasks you care about rather than relying on general assumptions. Another example I remember seeing was crypto/hashing performance where you could find embedded processors competing with much faster general chips because they had dedicated instructions for those use-cases, and performance would fall off of a cliff if you used different encryption or hashing settings or an unoptimized libssl.

I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration. There’s a fair amount of SIMD work where the cost of copying to/from the GPU is greater than the savings until you get over a particular amount of data and that threshold should be different on systems like the M1.

It's a poor post, much like the last one, if for no other reason than it's done so sloppily. There's nothing wrong with running simple, informal benchmarks but at a minimum, showing one's build and run details would make the limitations and outright mistakes more obvious.

The article shared the full configuration – perhaps you missed it?

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...