How did you convert time to the cycles ? Cycles are more about instruction set.

Sure, put a "roughly" in there. It's a simple multiplication by clock speed. This is Justine who designed APE. I think they know.

Edit, and also no, cycles are not "more about instruction set". They're pulses of electricity that make their way through a chip, they are about as far from instruction-set-specific as it gets.

jart

I've rounded the number and added a tilde since you're right. There's actually a lot of interesting depth behind this subject for x86 alone. RDTSC timestamps are guaranteed to be invariant across all models irrespective of clock speed (except k8) so a mapping exists between RDTSC reported clock cycles and nanoseconds; however, it usually isn't accessible on client grade hardware and has to be measured haphazardly by the operating system. In my experience on CPUs the last ten years it's usually in the ballpark of .323018 for both Intel and AMD. Multiply ticks by that and you get a pretty good nanosecond approximation. If you want a fixed point expression for your benchmarks:

    static inline unsigned long ClocksToNanos(unsigned long x, unsigned long y) {
      // approximation of round(x*.323018) which is usually
      // the ratio between inva rdtsc ticks and nanoseconds
      unsigned long difference = y >= x ? y - x : ~x + y + 1;
      return (difference * 338709) >> 20;
    }

That's just one quick and dirty way your approximation could be approximated. Should work great for any benchmark games. Obviously don't use it for like, an X-ray or something. You can get the RDTSC value as follows:

    static inline unsigned long Rdtsc(void) {
      unsigned long Rax, Rdx;
      asm volatile("rdtsc" : "=a"(Rax), "=d"(Rdx) : /* no inputs */ : "memory");
      return Rdx << 32 | Rax;
    }

Intel also has recommendations for things like using CPUID and memory fences if you need to reign in speculative execution, but they can be costly. If you want to go even deeper then RDTSCP has a nice feature that lets you know if the operating system switched you to a different core during your measurement interval because it gives you TSC_AUX.

cormacrelf

I'm not sure whether any of that timing precision will do any good on GitHub Actions, where this seems to be running. I don't think the machine is going to be quiet enough to get a good measurement of anything.

jart

Noise and jitter in the VM should cancel out with repeated measurements.

Here's something interesting. If I run my test in a RHEL5 (Linux c. 2010) VirtualBox VM then it actually goes faster than a modern Linux kernel running on bare metal, taking only 14µs. If I run it in a RHEL7 VM then it needs 131µs.

So >1ms hello world execution should be a red flag. If that's actually the latency of GitHub's kernels and can't be explained by something like a shell script doing the time interval measurements, then it'd be interesting to learn what it's doing.

cormacrelf

It appears helloworld is the only test with any repeats, and it only has 5 repeats. https://github.com/hanabi1224/Programming-Language-Benchmark... (Also, you can't average out the VM scheduling something else during the test, if it is constantly scheduling other stuff as it is a very busy build machine.)

Here's the measurement code, it appears to be significantly more complicated than a simple fork/exec/wait loop but that could just be all the C# getting in the way: https://github.com/hanabi1224/Programming-Language-Benchmark... Note that we are definitely measuring the C# async runtime to some degree. Nevertheless you are probably right that the bulk of this 1.8ms is in the executable under test, and it truly is just bloat. Running `hyperfine ./empty-main-function` from rustc on my Mac gives 0.8ms.