Article is very light on details, and contains zero citations, and only a single result of a single benchmark the guy ran, with no details of how it was run. It follows up by stating his theory as to why this happens as a fact (again with no citations). Author does not even offer us a clue as to what ARM core is used. The claim is:

  > The difference is that the computation
  > of the most significant bits of a
  > 64-bit product on an ARM processor
  > requires a separate and expensive
  > instruction.
I see no proof of this anywhere in the ARMv8 spec. You get the lower 64 bits of result using MUL and higher 64 bits using UMULH. Neither of those is that expensive.

Looking at [1] we can see that MUL has throughput of 1 and latency of 3, UMULH has 1/4 and 6, but as long as you do not issue another multiply just after your UMULH, this 1/4 throughput is easily hidden, since only the multiplier is busy, the rest of the CPU can go on. So unless your entire loop is under 6 cycles, or you simply have no instructions to schedule that do not need a multiply within the next 3 of UMULH, it shouldn't matter. Given those large constants that need to be loaded, they will each need 4 instrs (mov+movk+movk+movk), there are plenty of instrs to schedule after UMULH. Either OP's compiler messed up, or something entirely different is going on.

If, the author was using a weaker in-order core, say Cortex-A55, still more performance is expected than appears demonstrated. There [2] the low part is calculated in 2 or 3 cycles, the high in 4. But comparing an ARM in-order little core to a modern OoO x86 is just not fair.

EDIT: Indeed, looking [3] at what gcc produces for this code is sad. For example, why it is bothering synthesizing 0x1b03738712fad5c9 before issuing the first UMULH is unclear, but it IS stupid.

EDIT2: on skylake [4] MUL has a latency of 3, so faster than on ARM but not by that much. I'd guess the constant loading on arm using 4 instructions per constant hurts more than UMULH

EDIT3: in comments on original site, author said the ARM chip being used is a "Skylark" by "Ampere Computing" [5]. Given that I cannot find any info on that microarchitecture, I cannot say more about why it might be slow.

[1] Cortex®-A72 Software Optimization Guide: https://static.docs.arm.com/uan0016/a/cortex_a72_software_op...

[2] Cortex®-A55 Software Optimization Guide: https://static.docs.arm.com/epm128372/20/arm_cortex_a55_soft...

[3] Godbolt for this code: https://godbolt.org/z/UeOo6C

[4] Lists of instruction latencies, throughputs and micro operation breakdowns for Intel, AMD and VIA CPUs: https://www.agner.org/optimize/instruction_tables.pdf

[5] Skylark - Microarchitectures - AppliedMicro: https://en.wikichip.org/wiki/apm/microarchitectures/skylark

I think Daniel's use of the word "separate" in "separate and expensive" is ill-advised, as it implies a critique of ARM's ISA design in a way that isn't relevant for this case. One might be concerned if you needed all 128 bits in some other use, but not here.

As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.

This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.

I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."