topic I have changed my code to in Intel® oneAPI Math Kernel Library

serial vs parellel: different behaviour

Ferrazzano__Vincenzo — Fri, 02 Aug 2019 11:13:02 GMT

we wrote a header-only library, where we use IntelMKL (wrapped by Armadillo) and open MP in a nested way.

In broad strokes, in the header-only library we do something like this

for (int step =1;step<N_steps; i++){
	serial code: Some linear algebra (SVD/ Pseudoinverse).
	#parallel openmp for 
		matrix multiplication
	}

My projects usually have the following include structure:

exe using IntelMKL parallel in the VS->property->Intel Performance Libraries-> Use Intel MKL

static_lib I compile the header-only library in some function, it includes just IntelMKL headers

header-only including IntelMKL headers

we repeat this structure for different project, where the header-only library is in common.

For SOME of the projects, the code in the header-only library crashes in some random way, sometimes in the serial part (the SVD fails with message:
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.)

sometimes in the loop, where some out of bound location in vector is accessed. If remove the #openMP pragma, it just fails in the SVD at some point.

If I switch in the exe options IntelMKL to serial, it works just fine. Behaviour is the same if I include or exclude the OpenMP support from visual studio.

Any clue on what is causing this? The code spends most of the time in the parallel for, where the intelMKL should be serial anyway, but we would like to use any speedup we can have.

Our setup:

C++ 17

VS 15.9.14

Intel MKL 2019.4.245

CPU intel Xeon Gold 6126 CPU @ 2.59 GHz

SO: Windows 10

but we had this issue on different machines, and with previous version of VS, intel mkl and on different machines.

Happy to provide any information you might require.

here is the link to the MKL

Gennady_F_Intel — Mon, 05 Aug 2019 03:42:58 GMT

here is the link to the MKL usage model: disable Intel MKL internal threading for the whole application...

with regard to Intel MKtL ERROR: Parameter 4 was incorrect on entry to DLASCL: this is an unknown issue for MKL 2019. Could you check if the input data doesn't contain NaNs or Infs

in the case if the inputs are correct, could you give us the reproducer when the problem has happened?

I have changed my code to

Ferrazzano__Vincenzo — Wed, 21 Aug 2019 09:08:09 GMT

I have changed my code to something like this:

const int mklThreads = mkl_get_max_threads();
for (int step =1;step<N_steps; i++){
    serial code: Some linear algebra (SVD/ Pseudoinverse).
    mkl_set_num_threads(1);
    #parallel openmp for
        matrix multiplication
    mkl_set_num_threads(mklThreads);
    }

The problem still persists. I will try to reproduce the issue in a smaller project,

Hi. I replace the PINV with

Ferrazzano__Vincenzo — Fri, 06 Sep 2019 15:16:22 GMT

Hi.
I replace the PINV with the MKL only implementation suggested here

https://software.intel.com/en-us/articles/implement-pseudoinverse-of-a-matrix-by-intel-mkl,

now, linking against the parallel version makes the dgesdd routine to return 2.

As I mentioned, this happens only for some projects where we link out library. For others, everything works fine.

Another phenomenon that might hint in the right direction: after more testing/profiling, we realised that the number of threads in our project is not really take into account by intelMKL, even in those project where linking against the parallel version works fine. Regardless of the number of threads selected, performance are the same, although the number of threads seems to be correctly stetted.

We tried setting up the number of threads by any combination of:
omp_set_num_threads()

mkl_set_num_threads()

mkl_set_local_num_threads()

and setting back the old number of threads after the operation is performed.

To be sure, I saved the matrix, and tried the same function on a "fresh" projects. The performance scale with the number of processors.

regarding - dgesdd routine to

Gennady_F_Intel — Sat, 07 Sep 2019 04:26:53 GMT

regarding - dgesdd routine to return 2 - you may give us the reproducer and we will look at this case on our side.

regard to performance: what is the typical problem size? and how many of omp threads you run?