I am currently running the STREAM (https://www.cs.virginia.edu/stream/) benchmark on one of our systems (consisting of two E5-2680 v3, 12 cores each).
The reported stream performance is 105.6, 105.9, 111.6, and 112.2 GB/s for copy, scale, add and triad, respectively.
I find it strange that STREAM copy reaches 105.6 GB/s while MKL_SCOPY (and my own copy implementation) top out around 88 GB/s. Moreover, if I extract the tuned_STREAM_Copy() function into a separate copy.c file and then compile those files individually, I again observe the ~85 GB/s.
I also ensure that the arrays are aligned to 64 bytes and that tuned_STREAM_Copy() knows the alignment (via __assume_aligned). Furthermore, I'm using the __restrict__ keyword for both arrays.
A look into the assembly code reveals that all versions are actually using movntps (i.e., the streaming stores).
The questions now are: 1) Why is the STREAM copy so fast? 2) Can I trust the STREAM copy performance? 3) Is it possible that the compiler is pulling some tricks that it is not able to do if I compile the files separately?
The compiler of choice is Intel's icpc16.0.1 using the following compiler flags: -O3 -qopenmp -ffreestanding -xhost -D_OPENMP -DTUNED
The following environment variables showed the best performance: KMP_AFFINITY=compact,1 OMP_NUM_THREADS=24
The "-ffreestanding" flag prevents the compiler from replacing the STREAM Copy kernel with a call to a library routine. As you have seen, the "optimized" library routine is actually slower than the straightforward code generated by the compiler in this case. It is typically the case that the "optimized" library routine is faster when using a small number of threads or when using smaller array sizes, but the compiler-generated code is fastest when using all (or nearly all) of the cores on very large (>>L3 cache) arrays. The "-ffreestanding" flag has some undesirable effects if multi-architecture environments (it eliminates the startup checks on instruction set compatibility), but you can get the same effect with the "-nolib-inline" flag.
You can tell that your STREAM Copy performance is reasonable because it is almost identical to the STREAM Scale performance, and STREAM Scale has the same memory access pattern as STREAM Copy (both read one array and write one array).
Some other ideas:
I just noticed that icpc 16 did not issue movntps instructions for the kernels once they are separated out into a different file. Thus the performance drop is solely due to the write-allocate traffic.
Thanks for your input!