Tangential question: Does anybody have a good recommendation for a profiler that works well with massively async codebases?

My experience has been that the concurrent nature of coroutines can make it hard to reason about what's going on at a particular point in time. If you don't know how many things you're awaiting on at a specific moment (and what potential external stuff they may be interacting with), it's not exactly easy to identify memory usage of codepaths.

I like open source Perfetto UI (formerly Google Chrome about::/tracing) Wrote my own timing trace json export for Python and C++, across threads and processes. The docs have some pointers to creating your own traces with the Tracing SDK.

See https://ui.perfetto.dev

Looks like you can use magic-trace with a perfetto based UI. https://github.com/janestreet/magic-trace