A clickbait title for an in-depth look at hand-optimizing a very simple loop.

I'm not a compiler expert but if it's a "very simple loop" is it still too complex for the compiler to make good machine code? Did they use a bad compiler on purpose? Or are computers just not yet fast enough to do a good job with very simple loops in practical compilers?

This is the right answer:

https://news.ycombinator.com/item?id=36622584

Optimal assembly (forgoing SIMD, at least) for this loop on modern x86 is highly dependent on the entropy of the runtime data.

OK so they were abusing the benchmark, like the compiler's output would be faster on less contrived test data? Do I have to search what are fdo or pgo or cmov to understand the answer?

The compiler will generate different code if it knew the rates at which branches were taken.

If a branch is almost always taken or almost never taken, a compiler will want to emit a jump. The frontend will be able to predict the jump with high probability, and a successfully-predicted jump is "free." The cost of a misprediction is paid for by the near-zero cost of the many successful predictions.

If a branch is hard to predict (and taking versus not taking it would load a different value into a register/memory), the compiler wants to emit a conditional move (cmov). A conditional move is slightly "more expensive" in the backend because the CPU has to wait for the condition to resolve before it can execute instructions dependent on the output. However, it is much cheaper than many mispredicted branches (mispredicts around half of the time).

FDO (feedback-directed optimization) or PGO (profile-guided optimization) means "run the code on some sample input and profile how often branches are taken/not taken." It gives the compiler more information to generate better code.

The problem with the blog post is that the compiler has no idea what the function's input data will look like. It (arbitrarily) chose to generate branches instead of cmovs. However, if the benchmark input is better suited for cmovs, then the benchmark will (wrongly) show that the compiler generates "slow" assembly. But that's not a fair test, because with PGO/FDO the compiler would generate equivalent assembly to the "fast" assembly (actually, probably faster). Finally, the human (OP) is using their knowledge of the benchmark data "unfairly" to write better assembly than the compiler.

The takeaway is: most of the time, one can't optimize code/assembly in a vacuum. You also need to know what the input data and access patterns look like. FDO/PGO gives the compiler more data to understand what the input data/access patterns look like.

Thank you this is an amazingly comprehensive answer! Now I wonder what would be the workflow for using these compiler features. Like if I am a normal or bad C programmer and I write my program and use valgrind to check that it doesn't have obvious problems and I compile it with -march native or whatever, then I can add some step to the workflow to somehow re-compile it in a way that uses the branching or access patterns of some examples that I let it process for that purpose?

Yes, Clang supports FDO, but it might be hard to set up (I've never set up FDO myself). You could check out https://github.com/google/autofdo and https://clang.llvm.org/docs/UsersManual.html#profile-guided-....

(People within Google say "FDO", basically everyone else says "PGO".)