updated benchmarks, OpenMP 4 and cilk(tm) plus

TimP · ‎05-31-2016

I have updated some benchmarks extracted (with agreement of the original authors) from http://www.netlib.org/benchmark/vectors

These are the cases which benefit from combined threaded parallel and simd vectorization. The "vector lengths" are changed from the original 10,100,1000 to 100,1000,2000 as the length 2000x2000 is the minimum needed to approach full performance on recent Intel multi-core simd platforms.

Since the introduction of AVX2, Intel compilers have greatly improved ability to deal with simd vectorization of mixed strides +1 and -1 (s176). I don't show the array reduction version of s176 (closer to the original), which performs best at 1 thread, but doesn't scale to as many threads as the dot product reduction. In part, this is due to the tradeoff between vector length and number of threads fitting in cache and within the fixed problem size. The C++version reverses an array in order to use the STL syntax which admits only stride +1. Thus you can check whether the newer platforms require this reversal to reach full performance, as past Intel CPUs did.

Recent Intel compilers have progressed to where C source code is superior to SSE2 intrinsics in C++ s126(). Both Intel and gnu compilers take liberties to replace the specified sequential multiply and add intrinsics with FMA, which is not necessarily an optimization in this context. Intel default option -opt-streaming-stores replaces the final move with a nontemporal, which has been shown to be a slight de-optimization.

The benchmarks are nearly evenly divided between those which work with the ideal parallel outer simd vector inner loop arrangement (25 years ago it was called Covariant Outer Vector Inner), and those which have serial dependency at 1 of 2 levels of loops and so can support parallelism only in the outer loop, using omp do|for simd or cilk_for _Simd. While the opt-report claims great speedup for cilk_for _Simd, and it is enough to be almost competitive with OpenMP on some platforms, it doesn't give a speedup over serial for with vector inner loop. The exaggerated speedup is due in part to the poor instruction level optimization of cilk_for without _Simd. Similarly, the unroll4 option is needed to give a reasonable basis for calculated vector speedup in C++ and Fortran, although the default unrolling is sufficient for Intel CPUs other than Haswell (no newer ones tested). Even under OpenMP, the vectorized outer loop speedup is positive only when running more than 6 cores, and has never reached 3x even on MIC (meaning speedup compared to simd or parallel alone).

The C++ benchmark uses schedule(runtime) so it is important to set OMP_SCHEDULE to a choice such as auto, guided, or dynamic,2.

On HyperThreaded host CPUs I would suggest OMP_PLACES=cores with OMP_NUM_THREADS set accordingly, as might be done in the code with OpenMP 4.5 get_omp_num_places(). An alternate method is the Intel OpenMP setting KMP_PLACE_THREADS, with values such as 59c,2t (for KNC). Likewise, CILK_NWORKERS may be tried at lower than default values.

While there is a potential for KMP_BLOCKTIME to delay the start of Cilk workers in the transition from Fortran benchmark harness to cilk_for, this has not been observed in practice at default settings. If anyone has a politically correct explanation for the performance deficit of cilk_for, they are welcome to post it.

Linux Makefiles are posted for host and MIC KNC. Typical ifort Windows compile options include -assume:underscore -names:lowercase -fpp -O3 -debug:inline-debug-info -align:array32byte -Qunroll:4

Intel Parallel Advisor (Windows) doesn't pick up the shortest running kernels, nor does it appear to work with cilk_for kernels.

The only methods for timing which are superior (on Windows ifort) to Fortran system_clock are QueryPerformance and rdtsc counters. Those show occasional anomalous results due to differencing timers between CPUs and out-of-order sequencing. With gfortran Windows, the system_clock (based on QueryPerformance) is good, as is ifort linux system_clock.

TimP · ‎05-31-2016

upload source files

jimdempseyatthecove · ‎06-01-2016

Can we assume you returned your results on 9-track magnetic tape in ASCII format :?

Good (and complete) work. I wonder what the actual performance difference is when the functions benchmarked are used in an actual program (as opposed to benchmark).

Jim Dempsey

McCalpinJohn · ‎06-01-2016

I have seen very little trouble with compiler vectorization of real-valued codes, but the icc compilers often generate really poor vector code for complex arrays stored in interleaved format. For the very simple example of computing the squared-magnitude of a complex (interleaved) vector, the compilers I tested generate "vectorized" code that is slower than the corresponding scalar code for data at any level of the memory hierarchy. A very simple intrinsic-based code using VPERMPS was ~3x faster for L1-contained data and ~2x faster for L2-contained data.

TimP · ‎06-01-2016

jimdempseyatthecove wrote:

Can we assume you returned your results on 9-track magnetic tape in ASCII format :?

Good (and complete) work. I wonder what the actual performance difference is when the functions benchmarked are used in an actual program (as opposed to benchmark).

Jim Dempsey

Thanks.

The intention is to demonstrate how to combine simd and threaded parallelism in an effective way (as well as showing how the same case looks in the 3 program language systems), not to predict performance of any real application. In order to compare the performance of the various vector lengths, those shorter length cases are repeated so as to permit timing them with a similar amount of total operations as the longest ones. Then the cache usage pattern is not in any way representative of a real application (although it does demonstrate cases of failed prefetch on MIC).

I just verified that QueryPerformance timer calls may be used to get fairly clean timings with ifort Windows, but the source code (largely borrowed from posts on ifort Windows forum) is ugly. Probably cleaner to make the QP calls directly in C

TimP · ‎06-01-2016

John McCalpin wrote:

I have seen very little trouble with compiler vectorization of real-valued codes, but the icc compilers often generate really poor vector code for complex arrays stored in interleaved format. For the very simple example of computing the squared-magnitude of a complex (interleaved) vector, the compilers I tested generate "vectorized" code that is slower than the corresponding scalar code for data at any level of the memory hierarchy. A very simple intrinsic-based code using VPERMPS was ~3x faster for L1-contained data and ~2x faster for L2-contained data.

It's been discussed before; you need the complex-limited-range option to get vectorization performance gain for double complex using abs(), divide, sqrt(), and the like. As you said, the overhead of interpolating scalar full range code for those operations in a vectorized loop is greater than when vectorization is disabled. The vec-report isn't much help, as the only useful rating is the ideal clocks per loop quotation once you invoke limited-range. Single/float complex without complex-limited-range is performed by promoting critical operations to double, so there is still some value to vectorization, but it may not run any faster than full double with limited range. I prefer to set complex-limited-range explicitly rather than -fp-model fast=2, but then the imf-domain implications also need to be considered for MIC.

AVX et al. seems to solve one of the problems of complex double, which is that simd optimization of double complex interleaved format in the older instruction sets including SSE3 is not termed vectorization, nor reported in opt-report, since it performs only one iteration at a time. Still, the SSE3 limited range code may come out faster than AVX code, since AVX dropped the instruction level support for interleaved complex format multiplication which was introduced in SSE3.

I suppose it would be an unwelcome suggestion to ask whether the compiler could prune automatically when you set the architecture to AVX plus SSE3, when it finds a loop where AVX can't match SSE performance.

Is the case you mentioned one which the compiler could consider optimizing for AVX2? Maybe you could post it here or submit an IPS feature request.

TimP · ‎06-01-2016

updating Fortran files to use QueryPerformance for ifort Windows in place of system_clock. This gives adequate resolution and reconciles the difference in Windows OS hooks between ifort and gfortran. Use of plain iso_c_binding linkage to QueryPerformance API has less overhead than examples I've found which go through kernel32.mod.

TimP · ‎06-03-2016

Some of these cases offer speedup only from simd vectorization or parallelization alone, on a single CPU.

In s125 and s2102, parallel speedup is due mostly to nontemporal streaming stores and engaging multiple memory controllers. For gnu compilers, nontemporal stores are achieved only by switching to simd intrinsics. For Intel compilers, those functions may be compiled separately with -Qopt-streaming-stores:always or by decorating the inner loops with vector always pragma/directive.

Cases s231 and s333-5 will run faster on 1 or 2 cores if the omp directives are de-activated and the loop nesting switched for optimum single thread simd vectorization, as ifort can do automatically. In terms of efficiency per core, parallelization is disappointing for such cases. While current gnu compilers parse the outer loop vectorization simd clause, they ignore it.

TimP · ‎06-04-2016

Another interesting point for your compiler trivia list is the effect on Haswell of turning off fma (-Qfma-, -mno-fma) on the various compilers, or setting AVX target (particularly with Intel compilers). How did Intel manage to set up their compiler and hardware so that optimization for Sandy Bridge works so well for Haswell? Can you verify that setting -QxHost -arch:AVX is better than -QxHost alone? Does Qopt-report4 offer evidence that AVX2 code which is not faster than AVX has been pruned, even though it has higher rated potential speed?

Hint: reduction, both scalar sum reduction (s176) and array "prefix sum" (s126, s235) may be sensitive to the longer latency of fma. Intel (but not gnu) tries to compensate in the scalar reduction by using a very large riffle factor, and to some extent in the array prefix sum by unroll (not observing the documented unrolling controls). Using simd intrinsics may prevent unroll with Intel but not gnu compilers. Both Intel and gnu compilers may choose whether to use fma according to compile flags but not according to choice of intrinsics, and Intel compilers may over-ride the programmer's choice on use of streaming store intrinsics, where that is the only option to control it with gnu compilers.

Also, there may be cases where the AVX code has to split mis-aligned memory access for adequate performance, and this may still be beneficial on Haswell (or at least not pose a problem for memory bandwidth limited code).

While AVX2 code performance may be more dependent on unrolling, Intel Fortran seems not to consistently use as effective an unrolling plan when AVX2 is set.

Loop body alignments are not always set by various compilers and may improve effectiveness of unroll.