Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
35 Views

serial vs parellel: different behaviour

Hi

we wrote a header-only library, where we use IntelMKL (wrapped by Armadillo) and open MP in a nested way.

In broad strokes, in the header-only library we do something like this

for (int step =1;step<N_steps; i++){
	serial code: Some linear algebra (SVD/ Pseudoinverse).
	#parallel openmp for 
		matrix multiplication
	}

My projects usually have the following include structure:

        exe               using IntelMKL parallel in the VS->property->Intel Performance Libraries-> Use Intel MKL

         ^

    static_lib           I compile the header-only library in some function, it includes just IntelMKL headers

        ^

 header-only         including IntelMKL headers

 

we repeat this structure for different project, where the header-only library is in common.

For SOME of the projects, the code in the header-only library crashes in some random way, sometimes in the serial part (the SVD fails with message: 
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.)

sometimes in the loop, where some out of bound location in vector is accessed. If remove the #openMP pragma, it just fails in the SVD at some point.

If I switch in the exe options IntelMKL to serial, it works just fine. Behaviour is the same if I include or exclude the OpenMP support from visual studio.

Any clue on what is causing this? The code spends most of the time in the parallel for, where the intelMKL should be serial anyway,  but we would like to use any speedup we can have.

Our setup:

C++              17

VS                15.9.14

Intel MKL      2019.4.245

CPU             intel Xeon Gold 6126 CPU @ 2.59 GHz

SO:              Windows 10

but we had this issue on different machines, and with previous version of VS, intel mkl and on different machines.

Happy to provide any information you might require.

0 Kudos
4 Replies
Highlighted
Moderator
35 Views

here is the link to the MKL usage model: disable Intel MKL internal threading for the whole application...

with regard to Intel MKtL ERROR: Parameter 4 was incorrect on entry to DLASCL: this is an unknown issue for MKL 2019. Could you check if the input data doesn't contain NaNs or Infs

in the case if the inputs are correct, could you give us the reproducer when the problem has happened?

 

0 Kudos
Highlighted
35 Views

I have changed my code to something like this:

const int mklThreads = mkl_get_max_threads();
for (int step =1;step<N_steps; i++){
    serial code: Some linear algebra (SVD/ Pseudoinverse).
    mkl_set_num_threads(1);
    #parallel openmp for
        matrix multiplication
    mkl_set_num_threads(mklThreads);
    }

The problem still persists.  I will try to reproduce the issue in a smaller project,

 

0 Kudos
Highlighted
35 Views

Hi. 
I replace the PINV with the MKL only implementation suggested here

https://software.intel.com/en-us/articles/implement-pseudoinverse-of-a-matrix-by-intel-mkl,

now, linking against the parallel version makes the dgesdd routine to return 2.

As I mentioned, this happens only for some projects where we link out library. For others, everything works fine.

Another phenomenon that might hint in the right direction: after more testing/profiling, we realised that the number of threads in our project is not really take into account by intelMKL, even in those project where linking against the parallel version works fine. Regardless of the number of threads selected, performance are the same, although the number of threads seems to be correctly stetted.  

We tried setting up the number of threads by any combination of:
omp_set_num_threads()

mkl_set_num_threads()

mkl_set_local_num_threads()

and setting back the old number of threads after the operation is performed.

To be sure, I saved the matrix, and tried the same function on a "fresh" projects. The performance scale with the number of processors.

 

0 Kudos
Highlighted
Moderator
35 Views

regarding - dgesdd routine to return 2 - you may give us the reproducer and we will look at this case on our side.

regard to performance: what is the typical problem size? and how many of omp threads you run? 

0 Kudos