If you think of tcmalloc as an old crusty allocator, you've probably only seen the gperftools version of it.
This is the version Google now uses internally: https://github.com/google/tcmalloc
It's worth a fresh look. In particular, it supports per-CPU caches as an alternative to per-thread caches. Those are fantastic if you have a lot more threads than CPUs. I haven't checked if it's been adapted for the latest upstream kernel API, but there's also the idea of "vcpu"-based caches: basically rather than a physical cpu id, it's an (optionally per-numa-node-based) dense id assigned to active threads, so that it still works well if you have a small cpu allocation for this process on a many-core machine.
If you are in doubt, you should simply use what ClickHouse is using.
Yup [0].
There's even other complete libraries like tcmalloc [1] and jemalloc [2].
[0] https://stackoverflow.com/a/262481/1111557