Nice write up! Using BPF to trace malloc/free is good example of the tool’s power. Unfortunately, IME, this approach doesn’t scale to very high load services. Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

It would be great if one could configure the uprobes for malloc/free to trigger one in N times but when I last looked they were unconditional. It didn’t help to have the BPF probe just return early, either — the cost is in getting into the kernel to start with.

However, jemalloc itself has great support for producing heap profiles with low overhead. Allocations are sampled and the stacks leading to them are recorded in much the same way as the linked BPF approach:

https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...

> Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

Shameless plug in case you (or anyone else) is interested, I wrote a memory profiler for exactly this usecase:

https://github.com/koute/bytehound

It's definitely not perfect, but it's relatively fast, has an okay-ish GUI, and it's even scriptable: https://koute.github.io/bytehound/memory_leak_analysis.html