"My first attempt at a benchmark involved allocating and freeing blocks of random size. Twitter friends correctly scolded me and said that's not good enough. I need real data with real allocation patterns and sizes."
"The goal is to create a "journal" of memory operations. It should record malloc and free operations with their inputs and outputs. Then the journal can be replayed with different allocators to compare performance."
I don't think that is sufficient either. You need to mix the malloc/free with the other work, because the other work knocks your allocator data structures out of cache and pollutes your TLB and branch predictor. And loads up the dram interface (eg by the gpu fetching data from it). Etc etc.
I would have measured the time directly inside of the Doom allocator and the OP could produce the same charts directly from the measured data without replaying the allocation log.
And IF I were going to collect that data, I'd grab the call stack as well to see where the allocations were coming from. There might be a chance to retro fit an arena allocator, but would think Carmack code is already probably optimal? Yes it is, for the reason we can even have this discussion about it.
The other change I would make to the experiment, is to have the traces be taken from a timedemo and not via regular gameplay so that hypothesis can be tested. At the very least, one could precisely time the impact of instrumentation.
> I would have measured the time directly inside of the Doom allocator
Exactly! The author already has all the infra necessary (rdtsc for each call). Measuring time inside the game would be more accurate and simpler. Why did the author do things this way? I must be missing something.
(By the way, the author even seems to know about this issue since they added code that simulates using the allocated blocks (touching one byte for every allocated 4k), but that does not feel like it's nearly enough.)
> grab the call stack as well
Maybe this part could be done with no code with VTune / Linux perf? Sure, those only gather stochastic measurements (so not ideal for the original latency measurement). But to get a rough idea of where the costly allocations come from, it could be an easy way.
> Why did the author do things this way? I must be missing something.
Likely because it's easier to test against different allocators with a small replay tool than it is to try and get Doom3 to compile against dlmalloc, jemalloc, mimalloc, rpmalloc and tlsf.
I would also bet that getting the game to perform the exact same series of allocations would be an intractable problem to solve. I don't think Doom 3 has a benchmark mode; the author just recorded themselves loading the game, loading a level, doing a bit of gameplay, etc.
timedemo uscache
The DOOM3 source is here https://github.com/id-Software/DOOM-3
There are lots of facilities for recording and playing back demos. The author could record their own gameplay and play it back in realtime vs running timedemo (which plays back as fast as possible).
https://github.com/id-Software/DOOM-3/blob/master/neo/framew...
2865: cmdSystem->AddCommand( "recordDemo", "records a demo" );
2866: cmdSystem->AddCommand( "stopRecording", "stops demo recording" );
2867: cmdSystem->AddCommand( "playDemo", "plays back a demo",
2868: cmdSystem->AddCommand( "timeDemo", "times a demo",
2869: cmdSystem->AddCommand( "timeDemoQuit", "times a demo and quits",
https://www.youtube.com/watch?v=CLA42q3myCg