STREAM copy bandwidth

Paul_S_ · ‎08-09-2016

Hi all,

I am currently running the STREAM (https://www.cs.virginia.edu/stream/) benchmark on one of our systems (consisting of two E5-2680 v3, 12 cores each).

The reported stream performance is 105.6, 105.9, 111.6, and 112.2 GB/s for copy, scale, add and triad, respectively.

I find it strange that STREAM copy reaches 105.6 GB/s while MKL_SCOPY (and my own copy implementation) top out around 88 GB/s. Moreover, if I extract the tuned_STREAM_Copy() function into a separate copy.c file and then compile those files individually, I again observe the ~85 GB/s.

I also ensure that the arrays are aligned to 64 bytes and that tuned_STREAM_Copy() knows the alignment (via __assume_aligned). Furthermore, I'm using the __restrict__ keyword for both arrays.

A look into the assembly code reveals that all versions are actually using movntps (i.e., the streaming stores).

The questions now are: 1) Why is the STREAM copy so fast? 2) Can I trust the STREAM copy performance? 3) Is it possible that the compiler is pulling some tricks that it is not able to do if I compile the files separately?

The compiler of choice is Intel's icpc16.0.1 using the following compiler flags: -O3 -qopenmp -ffreestanding -xhost -D_OPENMP -DTUNED
The following environment variables showed the best performance: KMP_AFFINITY=compact,1 OMP_NUM_THREADS=24

Thanks, Paul

McCalpinJohn · ‎08-10-2016

The "-ffreestanding" flag prevents the compiler from replacing the STREAM Copy kernel with a call to a library routine. As you have seen, the "optimized" library routine is actually slower than the straightforward code generated by the compiler in this case. It is typically the case that the "optimized" library routine is faster when using a small number of threads or when using smaller array sizes, but the compiler-generated code is fastest when using all (or nearly all) of the cores on very large (>>L3 cache) arrays. The "-ffreestanding" flag has some undesirable effects if multi-architecture environments (it eliminates the startup checks on instruction set compatibility), but you can get the same effect with the "-nolib-inline" flag.

You can tell that your STREAM Copy performance is reasonable because it is almost identical to the STREAM Scale performance, and STREAM Scale has the same memory access pattern as STREAM Copy (both read one array and write one array).

Some other ideas:

It is sometimes difficult to tell if a code path is actually in use --- there may be non-temporal stores in the assembly code, but that code path might not be selected at runtime.
The "-qopt-streaming-stores always" flag is sometimes necessary to get the compiler to generate streaming stores.
The "-DTUNED" flag is not recommended for standard use -- it is there to make it easier to compile the kernels separately. Whether this will make a difference or not is a large topic that I don't have time to deal with today.
I can't think of an easy way to do it, but when fiddling with the code and running on multiple NUMA domains it is important to make sure that the changes don't break the processor-memory affinity that STREAM counts on. Testing on a single NUMA domain (and using something like "numactl --membind=0 --cpunodebind=0 ./stream") should rule out NUMA issues as a possible confounding factor.

Paul_S_ · ‎08-11-2016

I just noticed that icpc 16 did not issue movntps instructions for the kernels once they are separated out into a different file. Thus the performance drop is solely due to the write-allocate traffic.

Thanks for your input!