Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle on average. 1 cycle on average for any kind of loop is just flat out suspicious.

This requires a lot more digging to understand.

Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.

This is the benchmarking loop:

  for (size_t i = 0; i < N; i++) {
    out[i++] = g();
  }
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.

This is also visible in the assembly

  add     x8, x8, #2
So if I see this correctly the results are off by a factor of 2.

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...