The dependency chain is state += 0x60bee2bee120fc15ull or (state += UINT64_C(0x9E3779B97F4A7C15)); the rest of the calculations are independent per iteration.

Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.

It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.

brigade

The blog post links to the benchmark... It's repeatedly populating a 20k entry array with the results.

godbolt clang compiles it to:

    .LBB5_2:                                // =>This Inner Loop Header: Depth=1
        mul     x13, x11, x10
        umulh   x14, x11, x10
        eor     x13, x14, x13
        mul     x14, x13, x12
        umulh   x13, x13, x12
        eor     x13, x13, x14
        str     x13, [x0, x8, lsl #3]
        add     x8, x8, #2                      // =2
        cmp     x8, x1
        add     x11, x11, x9
        b.lo    .LBB5_2

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...