Memory layout of your code has such a big impact on performance on modern computers that measuring performance without removing that variable leads to wild goose chases, where you think you improved something, but in reality you incidentally got the compiler to move the code around a bit.
Emery Berger has an excelent talk on this [1], and a causal profiler that they developed called Coz[2].
Branch miss-predicts might be somewhat invariant to that, but still, one of the main points of the talk is that people do too much eyeball statistics, mistaking the variance of the underlying stochastic process with actual signal.
Coz is pretty trivial to set up with Zig too.
From what I know this is the closest project that fits your description.
Your problem might not even be CPU, if it's contention related, or timing related, overloaded queues, not pushing back at the right places, io bound, the bottleneck is work which is queued and executed elsewhere... Causal profiling is a technique which is relevant specifically because profiling can miss the forest for the trees: https://github.com/plasma-umass/coz
It's really easy to write a benchmark which measures a different scenario from what your application is doing. A classic example might be benchmarking a hashmap in a loop when that hashmap is usually used when cold.
I definitely agree about directing efforts to where you can make an impact and guiding that through measurement, but benchmarks can miss that there's a problem and blame the wrong part of the application.
If the difference is large enough, ms vs hours, you'd have to really screw up methodology to get the wrong result (I've done it almost that badly before).
https://www.youtube.com/watch?v=r-TLSBdHe1A
And who knows, maybe they'd even be interested in using coz afterwards:
There's a fundamental disconnect that makes it difficult for humans to reason about performance in computer programs. Because the speed of light is so slow, computer architecture as we know it will always rely on cache and OoO to be fast. The human brain does seem to work out of order, but it's only used to thinking about a world that runs in order. When we use theory of mind, we don't model other people's minds, we use our own as a model for theirs; see mirror neurons[1].
Because of this, standard code benchmarks are not very useful, unless they can demonstrate order-of-magnitude speedups. Even something like a causal profiler[2][3][4], which attempts to control for the volatile aspects of performance, is of limited use; it cannot control for all variables and its results will likely be invalidated by the same architectural variation it tries to control for. Instead (with respect to performance) we should focus on three factors:
- Code maintainability
- Algorithmic complexity
- Cache coherency
Everything else is a distraction.
1. https://en.wikipedia.org/wiki/Mirror_neuron
2. https://www.youtube.com/watch?v=r-TLSBdHe1A
You may think that's a cop-out, but consider something like coz[1]. Sqlite is managed and maintained by experts. There's significant capital behind investing engineering effort. However, better tooling still managed to locate 25% of performance improvement[2] & even 9% in memcached. Even experts have their limits & of course these tools require expertise so a tool like coz is still an expert-only tool. The successful evolution of the underlying concept for mass adoption will happen when it's possible to convert "expert speak" into something that can be easily and simply communicated outside CPU or compiler experts to meet the user on their knowledge level so they can dig in as deep as they need to/want to.
[1] https://github.com/plasma-umass/coz [2] https://arxiv.org/abs/1608.03676
I learnt about this tool from Emery Berger's talk [2] on this (at strangeloop), which I highly recommend. Lots of really nice insight, even outside of this tool.
[1] https://github.com/plasma-umass/coz [2] https://www.youtube.com/watch?v=r-TLSBdHe1A
What's interesting is that this technique correctly handles inter-thread effects like blocking, locking, contention, so it can point out inter-thread issues that traditional profilers and flame graphs struggle with.
Summary: https://blog.acolyer.org/2015/10/14/coz-finding-code-that-co...
Video presentation: https://www.youtube.com/watch?v=jE0V-p1odPg&t=0m28s
Coz: https://github.com/plasma-umass/coz
JCoz (Java version): http://decave.github.io/JCoz/ and https://github.com/Decave/JCoz