This article would've been a bit cooler if the conclusion wasn't "switch from default allocator to jemalloc" but instead "use jemalloc to prove something is wrong in the default allocator and track down + find a fix for what's wrong in the default allocator"

Unless I misunderstood that the default Rust allocator, with high request bodies and concurrency, is always going to suffer unfixable heap fragmentation like displayed in the article?

I agree that there could have been a more satisfying conclusion, but it is worth noting that jemalloc isn't a panacea. I've seen issues similar to Svix in both Rust and C++ applications that were heavy on ephemeral allocations, and have fixed it by doing all of the following, depending on the specific process:

  * Switching from libc malloc to jemalloc
  * Switching from libc malloc to tcmalloc (dating myself a little bit)
  * Switching from libc malloc to mimalloc
  * Switching from jemalloc to mimalloc
  * Switching from jemalloc to libc malloc
  * Switching from mimalloc to jemalloc

Possibly others; I only want to list cases I'm 100% certain of.

Heap fragmentation is just a reality of some allocation patterns without a GC runtime.

One certainly can (and, in some cases, should) make their application more allocator-friendly, but - aside from some often-low-hanging fruit - this is a time-intensive process involving a bit of, for lack of a better word, arcane knowledge (I should inline all my fields and allocate on the stack as much as possible, right? Yes, well, except ...)

If you already have a halfway decent benchmark suite or workload generator, which you'll want for other purposes anyway, it's often a lot quicker to just try a few other allocators and select the one that handles your workload best.

scottlamb

> * Switching from libc malloc to tcmalloc (dating myself a little bit)

If you think of tcmalloc as an old crusty allocator, you've probably only seen the gperftools version of it.

This is the version Google now uses internally: https://github.com/google/tcmalloc

It's worth a fresh look. In particular, it supports per-CPU caches as an alternative to per-thread caches. Those are fantastic if you have a lot more threads than CPUs. I haven't checked if it's been adapted for the latest upstream kernel API, but there's also the idea of "vcpu"-based caches: basically rather than a physical cpu id, it's an (optionally per-numa-node-based) dense id assigned to active threads, so that it still works well if you have a small cpu allocation for this process on a many-core machine.