This article would've been a bit cooler if the conclusion wasn't "switch from default allocator to jemalloc" but instead "use jemalloc to prove something is wrong in the default allocator and track down + find a fix for what's wrong in the default allocator"
Unless I misunderstood that the default Rust allocator, with high request bodies and concurrency, is always going to suffer unfixable heap fragmentation like displayed in the article?
I agree that there could have been a more satisfying conclusion, but it is worth noting that jemalloc isn't a panacea. I've seen issues similar to Svix in both Rust and C++ applications that were heavy on ephemeral allocations, and have fixed it by doing all of the following, depending on the specific process:
* Switching from libc malloc to jemalloc
* Switching from libc malloc to tcmalloc (dating myself a little bit)
* Switching from libc malloc to mimalloc
* Switching from jemalloc to mimalloc
* Switching from jemalloc to libc malloc
* Switching from mimalloc to jemalloc
Possibly others; I only want to list cases I'm 100% certain of.Heap fragmentation is just a reality of some allocation patterns without a GC runtime.
One certainly can (and, in some cases, should) make their application more allocator-friendly, but - aside from some often-low-hanging fruit - this is a time-intensive process involving a bit of, for lack of a better word, arcane knowledge (I should inline all my fields and allocate on the stack as much as possible, right? Yes, well, except ...)
If you already have a halfway decent benchmark suite or workload generator, which you'll want for other purposes anyway, it's often a lot quicker to just try a few other allocators and select the one that handles your workload best.
If you think of tcmalloc as an old crusty allocator, you've probably only seen the gperftools version of it.
This is the version Google now uses internally: https://github.com/google/tcmalloc
It's worth a fresh look. In particular, it supports per-CPU caches as an alternative to per-thread caches. Those are fantastic if you have a lot more threads than CPUs. I haven't checked if it's been adapted for the latest upstream kernel API, but there's also the idea of "vcpu"-based caches: basically rather than a physical cpu id, it's an (optionally per-numa-node-based) dense id assigned to active threads, so that it still works well if you have a small cpu allocation for this process on a many-core machine.