Benchmarks sound like their fighting hardware frequency throttling more then actually benchmarking the chips.

One big issue to point out:

>Floating Point: NAMD

GCC sucks at automatic hardware vectorization. So does the LLVM. Really the only time you can count on getting automatic hardware vectorization is if you shell out for the ICC AND write your code in Fortran. I'm gonna bet they didn't vectorized a goddamn thing, but we can't inspect anand's binary so we'll never know. The results are still likely correct, but IBM should have lost by a smaller margin.

TL;DR

POWER8 is fun but costs 5k more then Xeon per rack mount and uses about 2x the power usage for 10% less performance on generalized work loads. But can pull off 10-15% more performance on _some specialized_ workloads. So meh?

My experience has been that icc (the C compiler) does a very good job of auto-vectorization. xlc does as well, provided you avoid the horribly misdesigned POWER6 architecture.

I agree that gcc sucks for this; we didn't have LLVM last time I was doing this kind of work, so I can't comment on that.

If you want to really use SIMD units (on x86) check out ISPC. I recommend writing small functions in it that work over large chunks of memory so that SIMD can run at full speed with good cache locality.

This ends up being -very- fast.