Any suggestions/resources on learning how to implement these kinds of optimizations?
Agner Fog's guides and the Intel reference documents are the best resources that I've found:
http://www.agner.org/optimize/
http://www.intel.com/content/www/us/en/processors/architectu...
IACA, perf, pmu-tools, and likwid are very useful tools.
https://software.intel.com/en-us/articles/intel-architecture...
https://perf.wiki.kernel.org/index.php/Main_Page