What does HackerNews think of coz?

Coz: Causal Profiling

Language: C

Poop: Performance Optimizer Observation Platform | Jun 2023

Nobody should do benchmarking like this anymore.

Memory layout of your code has such a big impact on performance on modern computers that measuring performance without removing that variable leads to wild goose chases, where you think you improved something, but in reality you incidentally got the compiler to move the code around a bit.

Emery Berger has an excelent talk on this [1], and a causal profiler that they developed called Coz[2].

Branch miss-predicts might be somewhat invariant to that, but still, one of the main points of the talk is that people do too much eyeball statistics, mistaking the variance of the underlying stochastic process with actual signal.

Coz is pretty trivial to set up with Zig too.

1. https://m.youtube.com/watch?v=r-TLSBdHe1A

2. https://github.com/plasma-umass/coz

Show HN: 10-40% faster LZMA decoder using x86 CMOVcc | Dec 2021

Expand Context ↕

For the impatient, the profiler is available at https://github.com/plasma-umass/coz

Tip: Time-Proportional Instruction Profiling [pdf] | Dec 2021

Expand Context ↕

Take a look at coz the causal profiler: https://github.com/plasma-umass/coz

From what I know this is the closest project that fits your description.

(Risp (In (Rust) (Lisp))) | Jul 2021

Expand Context ↕

"benchmarks always tell you if there's a problem and exactly where it is, and squash any unneeded discussion." Unfortunately, always is too strong of a word. Cache eviction in one parts can cause memory stalls on another part, indirection in the caller can prevent speculation. Type erasure can prevent inlining resulting in the called function being blamed for problem in the caller.

Your problem might not even be CPU, if it's contention related, or timing related, overloaded queues, not pushing back at the right places, io bound, the bottleneck is work which is queued and executed elsewhere... Causal profiling is a technique which is relevant specifically because profiling can miss the forest for the trees: https://github.com/plasma-umass/coz

It's really easy to write a benchmark which measures a different scenario from what your application is doing. A classic example might be benchmarking a hashmap in a loop when that hashmap is usually used when cold.

I definitely agree about directing efforts to where you can make an impact and guiding that through measurement, but benchmarks can miss that there's a problem and blame the wrong part of the application.

If the difference is large enough, ms vs hours, you'd have to really screw up methodology to get the wrong result (I've done it almost that badly before).

Prefix Trees in Action | Apr 2021

Expand Context ↕

Try getting them to watch this talk by Emery Berger:

https://www.youtube.com/watch?v=r-TLSBdHe1A

And who knows, maybe they'd even be interested in using coz afterwards:

https://github.com/plasma-umass/coz

C performance mystery: delete unused string constant | Jun 2020

> deleting this one line of code can have a dramatic effect on seemingly unrelated performance micro-benchmarks. Some numbers get better (e.g. +5%), some numbers get worse (e.g. -10%). The same micro-benchmark can get faster on one C compiler but slower on another

There's a fundamental disconnect that makes it difficult for humans to reason about performance in computer programs. Because the speed of light is so slow, computer architecture as we know it will always rely on cache and OoO to be fast. The human brain does seem to work out of order, but it's only used to thinking about a world that runs in order. When we use theory of mind, we don't model other people's minds, we use our own as a model for theirs; see mirror neurons[1].

Because of this, standard code benchmarks are not very useful, unless they can demonstrate order-of-magnitude speedups. Even something like a causal profiler[2][3][4], which attempts to control for the volatile aspects of performance, is of limited use; it cannot control for all variables and its results will likely be invalidated by the same architectural variation it tries to control for. Instead (with respect to performance) we should focus on three factors:

- Code maintainability

- Algorithmic complexity

- Cache coherency

Everything else is a distraction.

1. https://en.wikipedia.org/wiki/Mirror_neuron

2. https://www.youtube.com/watch?v=r-TLSBdHe1A

3. https://arxiv.org/pdf/1608.03676v1.pdf

4. https://github.com/plasma-umass/coz

Reflections on Software Performance | Feb 2020

Expand Context ↕

There's also something to be said for building better tooling in this area. Not everyone can achieve expertise in everything. Better tooling helps level the playing feel (& eventually outperform experts when the tooling becomes indispensable).

You may think that's a cop-out, but consider something like coz[1]. Sqlite is managed and maintained by experts. There's significant capital behind investing engineering effort. However, better tooling still managed to locate 25% of performance improvement[2] & even 9% in memcached. Even experts have their limits & of course these tools require expertise so a tool like coz is still an expert-only tool. The successful evolution of the underlying concept for mass adoption will happen when it's possible to convert "expert speak" into something that can be easily and simply communicated outside CPU or compiler experts to meet the user on their knowledge level so they can dig in as deep as they need to/want to.

[1] https://github.com/plasma-umass/coz [2] https://arxiv.org/abs/1608.03676

Measure code execution time accurately in Python | Jan 2020

Expand Context ↕

I haven't had a chance to mess around with it much, but the coz-profiler [1] uses basically the same idea, but the other way around (i.e., slow down everything else to get a "virtual speedup") which helps quantify what the effect of optimizing some code would be.

I learnt about this tool from Emery Berger's talk [2] on this (at strangeloop), which I highly recommend. Lots of really nice insight, even outside of this tool.

[1] https://github.com/plasma-umass/coz [2] https://www.youtube.com/watch?v=r-TLSBdHe1A

Amazon CodeGuru – Preview | Dec 2019

Expand Context ↕

A profiling tool I want to try out—it seems almost magical—is Coz. It can estimate the effect of speeding up any line of code. It does this by pausing (!) other threads, so it gives a 'virtual' speed up for that line.

What's interesting is that this technique correctly handles inter-thread effects like blocking, locking, contention, so it can point out inter-thread issues that traditional profilers and flame graphs struggle with.

Summary: https://blog.acolyer.org/2015/10/14/coz-finding-code-that-co...

Video presentation: https://www.youtube.com/watch?v=jE0V-p1odPg&t=0m28s

Coz: https://github.com/plasma-umass/coz

JCoz (Java version): http://decave.github.io/JCoz/ and https://github.com/Decave/JCoz

COZ: Finding Code That Counts with Causal Profiling [pdf] | Oct 2015

It is worth noting that the profiler is fully open-source: https://github.com/plasma-umass/coz. There is even a package for Arch Linux!