Tangential question: Does anybody have a good recommendation for a profiler that works well with massively async codebases?
My experience has been that the concurrent nature of coroutines can make it hard to reason about what's going on at a particular point in time. If you don't know how many things you're awaiting on at a specific moment (and what potential external stuff they may be interacting with), it's not exactly easy to identify memory usage of codepaths.
I like open source Perfetto UI (formerly Google Chrome about::/tracing) Wrote my own timing trace json export for Python and C++, across threads and processes. The docs have some pointers to creating your own traces with the Tracing SDK.
Looks like you can use magic-trace with a perfetto based UI. https://github.com/janestreet/magic-trace