I wrote a little tool that used perf profiles from our production fleet to generate a custom linker script that reordered our main server program’s binary to be significantly more cache friendly. The heuristic I came up with for reordering was one of the few (maybe the only) genuine “eureka” moments I’ve had in my career.

And the performance win was extremely nice :-)

For anyone curious about using something like this technique, Facebook has a similar tool.

https://github.com/facebookincubator/BOLT