1/4 second to plow through 1 GB of memory is certainly fast compared to some things (like a human reader), but it seems oddly slow relative to what a modern computer should be capable off. Sure, it's a lot faster than a human, but that's only 4 GB/s! A number of comments here have mentioned adding some prefetch statements, but for linear access like this that's usually not going to help much. The real issue (if I may be so bold) is all the TLB misses. Let's measure.

Here's the starting point on my test system, an Intel Sandy Bridge E5-1620 with 1600 MHz quad-channel RAM:

  $ perf stat bytesum 1gb_file\n  Size: 1073741824\n  The answer is: 4\n  Performance counter stats for 'bytesum 1gb_file':\n\n  262,315 page-faults         #    1.127 M/sec\n  835,999,671 cycles          #    3.593 GHz\n  475,721,488 stalled-cycles-frontend   #   56.90% frontend cycles idle\n  328,373,783 stalled-cycles-backend    #   39.28% backend  cycles idle\n  1,035,850,414 instructions            #    1.24  insns per cycle\n  0.232998484 seconds time elapsed\n

\nHmm, those 260,000 page-faults don't look good. And we've got 40% idle cycles on the backend. Let's try switching to 1 GB hugepages to see how much of a difference it makes:

  $ perf stat hugepage 1gb_file\n  Size: 1073741824\n  The answer is: 4\n  Performance counter stats for 'hugepage 1gb_file':\n\n  132 page-faults               #    0.001 M/sec\n  387,061,957 cycles                    #    3.593 GHz\n  185,238,423 stalled-cycles-frontend   #   47.86% frontend cycles idle\n  87,548,536 stalled-cycles-backend     #   22.62% backend  cycles idle\n  805,869,978 instructions              #    2.08  insns per cycle\n  0.108025218 seconds time elapsed\n

\nIt's entirely possible that I've done something stupid, but the checksum comes out right, but the 10 GB/s read speed is getting closer to what I'd expect for this machine. Using these 1 GB pages for the contents of a file is a bit tricky, since they need to be allocated off the hugetlbfs filesystem that does not allow writes and requires that the pages be allocated at boot time. My solution was a run one program that creates a shared map, copy the file in, pause that program, and then have the bytesum program read the copy that uses the 1 GB pages.

Now that we've got the page faults out of the way, the prefetch suggestion becomes more useful:

  $ perf stat hugepage_prefetch 1gb_file\n  Size: 1073741824\n  The answer is: 4\n\n  Performance counter stats for 'hugepage_prefetch 1gb_file':\n 132 page-faults            #    0.002 M/sec\n 265,037,039 cycles         #    3.592 GHz\n 116,666,382 stalled-cycles-frontend   #   44.02% frontend cycles idle\n 34,206,914 stalled-cycles-backend     #   12.91% backend  cycles idle\n 579,326,557 instructions              #    2.19  insns per cycle\n 0.074032221 seconds time elapsed\n

\nThat gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core. Based on prior knowledge of this machine, I'm issuing one prefetch 512B ahead per 128B double-cacheline. Why one per 128B? Because the hardware "buddy prefetcher" is grabbing two lines at a time. Why do prefetches help? Because the hardware "stream prefetcher" doesn't know that it's dealing with 1 GB pages, and otherwise won't prefetch across 4K boundaries.

What would it take to speed it up further? I'm not sure. Suggestions (and independent confirmations or refutations) welcome. The most I've been able to reach in other circumstances is about 18 GB/s by doing multiple streams with interleaved reads, which allows the processor to take better advantage of open RAM banks. The next limiting factor (I think) is the number of line fill buffers (10 per core) combined with the cache latency in accordance with Little's Law.

jvns

Can you share the prefetching code you wrote? I tried to use __builtin_prefetch, but couldn't figure out how to make it faster.

nkurz

My approach was somewhat ugly. I went this this:

  for (int i = 0; i < n; i += 8*16) {\n      __builtin_prefetch(&a[i + 512]);\n      for (int j = 0; j < 8; j++) {\n        __m128i v = _mm_load_si128((__m128i *)&a[i + j*16]);\n        __m128i vl = _mm_unpacklo_epi8(v, vk0); \n        __m128i vh = _mm_unpackhi_epi8(v, vk0);\n        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));\n        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));\n    }\n

\nThe goal is to issue one prefetch for each 128B block of data you read. There are probably better ways to do this than what I did. I'm hoping the compiler did something reasonable, and haven't really looked at the generated assembly.

Also, if it indeed is that case that TLB misses are the major factor (and I think it is), I don't think you will have much success with by adding prefetch alone. Trying right now, I get a slight slowdown with just the prefetch. It may only in combination with hugepages that you get a positive effect.

jvns

I profiled my program looking for TLB misses, and got 0 dTLB misses (https://gist.github.com/jvns/a42ff6f48c659cfc4600).

nkurz

I think 'perf' is probably lying to you. Although maybe it's not a lie, as your Gist does contain that line ' dTLB-misses'. Perf tries very hard to be non-CPU specific, and thus doesn't do a great job of handling the CPU specific stuff.

What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools

You also might have better luck with 'likwid': https://code.google.com/p/likwid/

Here's the arguments I was giving it to check:

  sudo likwid -C 1 -g              \\\n       INSTR_RETIRED_ANY:FIXC0,    \\\n       CPU_CLK_UNHALTED_CORE:FIXC1,\\      \n       CPU_CLK_UNHALTED_REF:FIXC2, \\     \n       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \\\n       ./bytesum 1gb_file\n\n  |        INSTR_RETIRED_ANY        | 7.38826e+08 |\n  |      CPU_CLK_UNHALTED_CORE      | 5.42765e+08 |\n  |      CPU_CLK_UNHALTED_REF       | 5.42753e+08 |\n  | DTLB_LOAD_MISSES_WALK_COMPLETED | 1.04509e+06 |\n\n  sudo likwid -C 1 -g              \\\n       INSTR_RETIRED_ANY:FIXC0,    \\\n       CPU_CLK_UNHALTED_CORE:FIXC1,\\      \n       CPU_CLK_UNHALTED_REF:FIXC2, \\     \n       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \\\n       ./hugepage_prefetch 1gb_file\n\n  |        INSTR_RETIRED_ANY        | 5.79098e+08 |\n  |      CPU_CLK_UNHALTED_CORE      | 2.63809e+08 |\n  |      CPU_CLK_UNHALTED_REF       | 2.63809e+08 |\n  | DTLB_LOAD_MISSES_WALK_COMPLETED |    11970    |\n

\nThe other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.

ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.