They also have two tools which basically implement this analysis and spit a bunch of very useful metric that are actionable and very easy to understand
- Intel Vtune is a fantastic tool to start with. It's currently free to use, support most OSes and very friendly to use for beginner.
- Intel pmu-tools (https://github.com/andikleen/pmu-tools) is basically command line version of Vtune.
It uses CPU performance counters to show things like ITLB_Misses or MEM_Bandwidth. It won't show when you're waiting for GPU/SSD/etc because those aren't visible from CPU performance counters. I'm not aware of a single tool that will do everything, unfortunately.
Also, this isn't a "benchmarking suite"; it's a tool you can use to instrument whatever load you're running, which I'd say is better. It's often used to improve software but could also identify if faster RAM will help.
I've also spent a lot of time recently trying to understand how 'perf' works at a low level. I'm getting closer, but a lot of it is still pretty impenetrable. There are four main resources that I'd recommend.
The first, which you've almost surely found, is the "Perf Wiki": https://perf.wiki.kernel.org/index.php/Main_Page There's not a lot there, but it's a good introduction.
The second, which you might have stumbled across, is the text documentation scattered throughout the kernel source. Most are in https://github.com/torvalds/linux/tree/master/tools/perf/Doc..., but the most useful one is up one directory at https://github.com/torvalds/linux/blob/master/tools/perf/des....
The third is Andi Kleen's PMU Tools: https://github.com/andikleen/pmu-tools The 'jevents' library within this illustrates how to use 'perf' to set up the counters while using 'rdpmc' from userspace to read them.
The fourth is Vince Weaver's Unofficial Linux Perf Events Web-Page (http://web.eece.maine.edu/~vweaver/projects/perf_events/) and his associated Perf Event Testsuite (https://github.com/deater/perf_event_tests). Tests make wonderful examples.
The deeper I got into it, the more I realized is that 'perf' is still evolving, and there is a lot of anger and discontent below the surface. There were (and are) competing alternatives, but 'perf' is politically in control. Much of what you read about 'perf' should be probably be viewed through the lens of "history written by the victor", and "the vanquished" may have different perspectives.
---
Separately, in case anyone is already familiar with the internals, here's an aspect where I'm currently stuck. There is an "offset" field which one is supposed to add to the result read from "rdpmc", but when I do so I get strange problems: https://github.com/nkurz/pmu-tools/commit/f2ab49207d4c7b7ddd...
http://www.agner.org/optimize/
http://www.intel.com/content/www/us/en/processors/architectu...
IACA, perf, pmu-tools, and likwid are very useful tools.
https://software.intel.com/en-us/articles/intel-architecture...
https://perf.wiki.kernel.org/index.php/Main_Page
What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools
You also might have better luck with 'likwid': https://code.google.com/p/likwid/
Here's the arguments I was giving it to check:
sudo likwid -C 1 -g \\\n INSTR_RETIRED_ANY:FIXC0, \\\n CPU_CLK_UNHALTED_CORE:FIXC1,\\ \n CPU_CLK_UNHALTED_REF:FIXC2, \\ \n DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \\\n ./bytesum 1gb_file\n\n | INSTR_RETIRED_ANY | 7.38826e+08 |\n | CPU_CLK_UNHALTED_CORE | 5.42765e+08 |\n | CPU_CLK_UNHALTED_REF | 5.42753e+08 |\n | DTLB_LOAD_MISSES_WALK_COMPLETED | 1.04509e+06 |\n\n sudo likwid -C 1 -g \\\n INSTR_RETIRED_ANY:FIXC0, \\\n CPU_CLK_UNHALTED_CORE:FIXC1,\\ \n CPU_CLK_UNHALTED_REF:FIXC2, \\ \n DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \\\n ./hugepage_prefetch 1gb_file\n\n | INSTR_RETIRED_ANY | 5.79098e+08 |\n | CPU_CLK_UNHALTED_CORE | 2.63809e+08 |\n | CPU_CLK_UNHALTED_REF | 2.63809e+08 |\n | DTLB_LOAD_MISSES_WALK_COMPLETED | 11970 |\n
\nThe other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.