The last serial part of my application is a call to DSYEVR. My attempts to parallelize it resulted in very strange behavior hope someone help me to understand.
Depending on the data I run DSYERV alone or two/three of them in a OMP PARALLEL SECTIONS. My application is compiled with icc on Cray with MKL 10.3 update 3 (the parallel version). The matrices are small, 61x61.
As suggested elsewhere, I call omp_set_nested(1), mkl_set_dynamic(0) and mkl_set_num_threads(n) (n: 1-8) at the beginning of the code. Then run my application on a varying number of threads (1-16).
With the above setup the performances drops dramatically going above 2 threads whathever number of threads I reserve to MKL.
To check my code I linked with --mkl=sequential and the scaling is what I expected. So I presume the culprit is MKL and its interactions with omp_set_nested.
I implemented also the "fake nesting" suggested in this forum (cannot find the reference anymore, but was about starting more threads than requested by OMP_NUM_THREADS) and there is a small speed advantage running on 4 nodes, but overall the scaling does not change. I interpret this as no parallelization of the DSYEVR calls.
Any idea? This call is clearly reducing my code scalability as seen also with profilers as Vampir.
Intel MKL the clever: she knows that small matrixes do not need to be considered. You take the big matrix: DSYEVR uses dsytrd, which partially parallelize and dlarfb which is good parallelize. It is necessary to organize the program code differently. The refined version is included in the last versions of Intel MKL dlarfb: http://redfort-software.intel.com/en-us/forums/showthread.php?t=77331