Threading bug in MKL 10.2 update 6

Luc_Buatois · ‎10-13-2010

Hello all,

we are currently using the PARDISO solver for solving symmetric definite systems. Since we had updated to MKL 10.2 update 6 we are facing a threading bug that strongly burdens the performance.

We have an 8 cores machine (no hyperthreading of course), use the mkl_set_num_threads(8), but when decomposing the matrix and solving a linear system, only 5 threads are fully used! In debug it's possible to check that 8 threads have been created by the mkl. We tried to set mkl_set_dynamic(0)... but any attempt lead to the same trouble: only 5 threads are used instead of 8, and the processing time is slower than with the MKL 10.2 update 2.

We also took some time to test this threading bug with the MKL 10.3 beta2, but that resulted in the same slowness.

Are you aware of such bug? Is there any plan to fix it soon?

Thanks in advance for you help.

Bests.

Alexander_K_Intel2 · ‎10-13-2010

Hi,

Could you provide us the values of iparm(60) and iparm(64) after PARDISO running? And what kind of processor do you use?

With best regards,

Alexander Kalinkin

Gennady_F_Intel · ‎10-13-2010

and what was the size of the problem you solve?

Luc_Buatois · ‎10-13-2010

We are using Xeon E5420. After PARDISO running, iparm(60)=0 which is coherent with our request of incore run, and iparm(64)=102000114.

Notice that the system fits largely in memory since it takes less than 2GB of RAM over the 16GB available. By the way, we saw that bug on all systems we tried whatever their sizes were.

Below are the params we used for PARDISO:

(all non referenced params have been set to 0, and below index are 0-based "C" index)

iparm_[ 0] = 1 ; /* No solver default */
iparm_[ 1] = 3 ; /* Fill-in reordering from METIS. 3==OpenMP METIS! */

We call PARDISO withe follwing params:

PARDISO( handle, 1, 0, 2, phase, N, values, row_index, columns, dummy_interger, 1, iparm_, 0, dummy_double, dummy_double, error ) ;

Thanks a lot for your help.

Bests.

Luc B.

Luc_Buatois · ‎10-13-2010

One of our tested matrix size was approximately 200'000 by 200'000 with 4'000'000 non-zero elements. But we saw the bug with all sizes, smaller and bigger ones.

Konstantin_A_Intel · ‎10-13-2010

Hello,

Could you please give us a bit more info: which PARDISO phase was really slow-down? Reordering, Factorization or Solving? You may set msglvl=1 and sent out outputs of 2 runs.

Thanks a lot,

Konstantin

Luc_Buatois · ‎10-14-2010

Hello, in fact all parallelized phases seem to have been impacted. The ones that are really important for us are factorizating and solving phases.

I need to precise and correct some things I previously said. With mkl 10.2 update 2, factorization works fine in parallel. Solving phase was not parallelized, hence was slow but workable. In update 6, factorization is slower when running in parallel. Only 5 threads are used even if 8 are created. Solve pass is parallelized, and is a little bit faster than in update 2, but still only 5 threads are running in this phase! Concerning mkl 10.3 beta2, I've made a mistake yesterday. In fact it runs exactly as 10.2 update 2: factorization works fine in parallel, and as solve phase seems not to be parallelized, is slow but workable. It seems that the recent work made on 10.2 update 6 on improving solve phase using parallelization is the key...

As asked, please find below the result of a pardiso run for a matrix size of 217007 by 217007 with exactly 3365857 nnz. Log seems to be completely broken, but I don't know why!

Thanks a lot for your help!

Luc.

Number of items read = 100

*** Error in PARDISO memory allocation: WORK_I0 , size to allocate: 0 bytes

================ PARDISO: solving a symm. posit. def. system =================
=============== PARDISO: solving a symmetric indef. system ==================
============== PARDISO: solving a real struct. sym. system ===================
============= PARDISO: solving a symmetric indef. system ====================
============ PARDISO: solving a compl. str. sym. system ================
================ PARDISO: solving a real nonsymmetric system ================

reorder
Summary PARDISO: (
factorize Time parlist:
solve Time parlist:
clean Time parlist:
to Time parlist:
Times: Time parlist:
Time fulladj: Time symbfct:

Time A to LU:
================ PARDISO: solving a complex nonsym. system ================
Time numfct :
Time cgs :

0.000000 s cgx iterations -16843009
0.000000 s
< Parallel Direct Factorization with #processors: > 3365857
< Hybrid Solver PARDISO with CGS/CG Iteration >

< Numerical Factorization with Level-3 BLAS performance >

< Linear system Ax = b>
#non-zeros in A: 64
non-zeros in A (): 0.000000
#columns for each panel: 73823
#independent subgraphs: 3481
< Preprocessing with multiple minimum degree with constraints >
< no multiple minimum degree on the separator nodes >
< Preprocessing with input permutation >

Percentage of computed non-zeros for LL^T factorization
0 %
1 %
2 %
[..]
98 %
99 %
100 %

*** Error in PARDISO memory allocation: WORK_I0 , size to allocate: 4013312 bytes

================ PARDISO: solving a symm. posit. def. system =================
=============== PARDISO: solving a Herm. pos. def. system ==================
============== PARDISO: solving a real struct. sym. system ===================
============= PARDISO: solving a Herm. pos. def. system ====================
============ PARDISO: solving a compl. str. sym. system ================
================ PARDISO: solving a real nonsymmetric system ================

reorder
Summary PARDISO: (
) Time parlist:
================ Time parlist:
Times: Time parlist:
Time fulladj: Time symbfct:

Time A to LU:
================ PARDISO: solving a complex nonsym. system ================
Time numfct :
Time cgs :

0.000000 s cgx iterations -16843009
0.000000 s
< Parallel Direct Factorization with #processors: > 3365857
< Hybrid Solver PARDISO with CGS/CG Iteration >

< Numerical Factorization with Level-3 BLAS performance >

< Linear system Ax = b>
#non-zeros in A: 64
non-zeros in A (): 0.000000
#columns for each panel: 73823
#independent subgraphs: 3481
< Preprocessing with multiple minimum degree with constraints >
< no multiple minimum degree on the separator nodes >
< Preprocessing with input permutation >
#supernodes: -2032018434

size of largest supernode: 4626015541689943357

*** Error in PARDISO memory allocation: WORK_I0 , size to allocate: 4013312 bytes

================ PARDISO: solving a symm. posit. def. system =================
=============== PARDISO: solving a Hermitian indef. system ==================
============== PARDISO: solving a real struct. sym. system ===================
============= PARDISO: solving a Hermitian indef. system ====================
============ PARDISO: solving a compl. str. sym. system ================
================ PARDISO: solving a real nonsymmetric system ================

reorder
Summary PARDISO: (
=========== Time parlist:
Time fulladj: Time symbfct:

Time A to LU:
================ PARDISO: solving a complex nonsym. system ================
Time numfct :
Time cgs :

0.000000 s cgx iterations -16843009
0.000000 s
< Parallel Direct Factorization with #processors: > 3365857
< Hybrid Solver PARDISO with CGS/CG Iteration >

< Numerical Factorization with Level-3 BLAS performance >

< Linear system Ax = b>
#non-zeros in A: 64
non-zeros in A (): 0.000000
#columns for each panel: 73823
#independent subgraphs: 3481
< Preprocessing with multiple minimum degree with constraints >
< no multiple minimum degree on the separator nodes >
< Preprocessing with input permutation >
#supernodes: -2032018434

size of largest supernode: 4626015541689943357

Luc_Buatois · ‎10-15-2010

Any idea of what's going wrong ? Thanks !

barragan_villanueva_ · ‎10-15-2010

Hi,

Just looking at messages:
#supernodes: -2032018434
and memory problems it looks like you should link with ILP64 MKL libraries (please use correct compiler options and link line)

Luc_Buatois · ‎10-15-2010

Thanks for your answer.

I tried to link with ILP64 with the corresponding compiler options, but still got exactly the same broken log, wrong number of threads used, slowness, but at least correct results!

I have the same troubles with any problem sizes, even very small. It looks like a scheduling bug.

Any other idea ?

Thanks a lot for your help!

barragan_villanueva_ · ‎10-16-2010

Hi,

Could you please send your compiler (C or FORTRAN) options and the whole link line?
Did you use compiler options: -i8 for FORTAN or -DMKL_ILP64 for C?
Small test-case to reproduce the problem would be very helpful.