- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

[Edited title to more accurately reflect the issue at hand]

Is the build number for the latest Intel MKL version 20161005 ? We are still struggling with a bug that was reported last year, regarding usage of the sparse cluster solver when using more than one rank.

The release notes for Intel MKL 2017 Update 1 indicate that this bug was fixed, but before we re-open it I'd like to make sure our test (done with 20161005) used the latest and greatest.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

On Linux it is 2017.1.132

On Windows it is 2017.1.143

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi OP,

Could you let us know the bug number or other link information ? so we can check the fix information?

Yes, the latest official release version is MKL 2017 update 1. ( update 2 will be release in the month, you should be able to get notification automatically from system)

You can check the build data from mkl_version.h file.

for example under windows:

#if 0

/*

! Content:

! Intel(R) Math Kernel Library (MKL) interface

!******************************************************************************/

#endif

#ifndef _MKL_VERSION_H_

#define _MKL_VERSION_H_

#define __INTEL_MKL_BUILD_DATE **20161006**

#define __INTEL_MKL__ 2017

#define __INTEL_MKL_MINOR__ 0

#define __INTEL_MKL_UPDATE__ 1

#define INTEL_MKL_VERSION 20170001

#endif

and under Linux :

#if 0

/*

! Content:

! Intel(R) Math Kernel Library (MKL) interface

!******************************************************************************/

#endif

#ifndef _MKL_VERSION_H_

#define _MKL_VERSION_H_

#define __INTEL_MKL_BUILD_DATE 20161006

#define __INTEL_MKL__ 2017

#define __INTEL_MKL_MINOR__ 0

#define __INTEL_MKL_UPDATE__ 1

#define INTEL_MKL_VERSION 20170001

You get the build data from sparse cluster solver, right? there is a little difference, but the version should be MKL 2017 update 1.

Best Regards

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Here is a reproducer of the bug. The sparse cluster solver is used to solve a small Ax=b system. It seems there is an issue with OpenMP threading/MPI interaction.

The bug will only appear once in a while, so it is critical to create a test matrix (and repeat the test multiple times) for each combination of MPI processes/OpenMP threads. This bug may be related to DPD200588182 that we reported previously and was marked as 'fixed' in the release notes here: https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-bug-fixes-list

Using MPI_INIT_THREAD() instead of MPI_INIT() helped 'a little' in the sense that the occurrence of the bug is 'reduced'.

The latest version of MKL is used - see the comment section in the source file below for a description of the commands setting up the environment, as well as the compile and build commands. The two required input files are attached.

! The matrix A is real, symmetric positive definite - and only its lower triangle is provided. A right-hand side vector B is ! provided so that the linear system A.X = B can be solved with CLUSTER_SPARSE_SOLVER. ! ! File inputs: ! ! - The file "file_1.txt" contains the size of the matrix A, its number of nonzero elements, the number of right-hand ! sides to be analyzed, and the index arrays IA and JA. ! - The file "file_2.txt" contains the matrix values and the right hand sides. ! ! Bug description: ! ! - The output "MAXVAL(ABS(X))" should be equal to 1. ! ! - For some combinations of (1) number of MPI processes (ranks) and (2) number of OpenMP threads per rank, the output ! "MAXVAL(ABS(X))" becomes incorrect. ! ! - The incorrect results are not always obtained; in fact, the problem shows up randomly. ! ! - In order to reproduce the bug, create a test matrix similar to the one below ! ! ! | Number of ! | OpenMP threads A "PASS" indicates that the correct result was obtained in 15 ! Number of MPI | 1 2 3 9 consecutive runs of the code. A "FAIL" indicates that at least ! processes | one run was faulty. ! |------------------------------- ! 1 | pass pass pass pass ! 2 | pass pass pass > FAIL < Note that many of these MPI ranks/OpenMP threads combinations were ! 3 | pass pass pass pass oversubscribing the computational resources (cores) at hand. ! 7 | pass pass pass pass ! 16 | pass pass pass pass ! ! ! ! Compilation/build/run commands (Linux): ! ! ! > module load icc/17.0 ! > . /opt/intel/compiler17/compilers_and_libraries_2017.1.132/linux/bin/compilervars.sh intel64 ! > . /opt/intel/compiler17/compilers_and_libraries_2017.1.132/linux/mkl/bin/mklvars.sh intel64 mod ! > . /opt/intel/compiler17/compilers_and_libraries_2017.1.132/linux/mpi/bin64/mpivars.sh ! > mpiifort -c $MKLROOT/include/mkl_spblas.f90 ! > mpiifort -c $MKLROOT/include/mkl_cluster_sparse_solver.f90 ! > mpiifort -c MAIN.f90 ! > mpiifort -o TEST *.o -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 ! -liomp5 -lpthread -lm -ldl ! > export OMP_NUM_THREADS=9 ! > mpirun -n 2 TEST PROGRAM MAIN USE MPI USE MKL_CLUSTER_SPARSE_SOLVER USE MKL_SPBLAS IMPLICIT NONE INTEGER(KIND=4),PARAMETER :: FILE_UNIT_1 = 1000 INTEGER(KIND=4),PARAMETER :: FILE_UNIT_2 = 2000 INTEGER(KIND=4) :: NUM_PROCS,RANK,IERR,MAXFCT,MNUM,MTYPE,PHASE,MSGLVL,IPARM(64),PERM(1),N,NRHS,NNZ,II INTEGER(KIND=4) :: MPI_REQUIRED,MPI_PROVIDED INTEGER(KIND=4),ALLOCATABLE :: IA(:), JA(:) REAL(KIND=8),ALLOCATABLE :: A(:),B(:,:),X(:,:) LOGICAL :: MPI_IS_INITIALIZED CHARACTER(LEN=128) :: MKL_STRING TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE) :: PT(64) MPI_REQUIRED = MPI_THREAD_MULTIPLE CALL MPI_INIT_THREAD(MPI_REQUIRED,MPI_PROVIDED,IERR) CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR) CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NUM_PROCS,IERR) CALL MPI_BARRIER(MPI_COMM_WORLD, IERR) CALL MKL_GET_VERSION_STRING(MKL_STRING) IF (RANK==0) THEN WRITE(*,*) 'MKL VERSION: ',TRIM(ADJUSTL(MKL_STRING)) WRITE(*,*) 'MPI_REQUIRED = ',MPI_REQUIRED,'; MPI_PROVIDED = ',MPI_PROVIDED OPEN(FILE_UNIT_1,FILE='file_1.txt',FORM='FORMATTED',STATUS='OLD') READ(FILE_UNIT_1,*) N READ(FILE_UNIT_1,*) NNZ READ(FILE_UNIT_1,*) NRHS END IF CALL MPI_BCAST(NRHS,1,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) CALL MPI_BCAST(NNZ,1,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) CALL MPI_BCAST(N,1,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) ALLOCATE(IA(N+1),JA(NNZ),A(NNZ),B(N,NRHS),X(N,NRHS)) IF (RANK==0) THEN READ(FILE_UNIT_1,*) IA READ(FILE_UNIT_1,*) JA CLOSE(FILE_UNIT_1) OPEN(UNIT=FILE_UNIT_2,FILE='file_2.txt',FORM='FORMATTED',STATUS='OLD') READ(FILE_UNIT_2,*) A READ(FILE_UNIT_2,*) B(:,1) CLOSE(FILE_UNIT_2) ENDIF MAXFCT = 1 MNUM = 1 MTYPE = 2 PERM = 0 MSGLVL = 1 DO II=1,64 PT(II)%DUMMY = 0 ENDDO PHASE = 13 CALL CLUSTER_SPARSE_SOLVER(PT,MAXFCT,MNUM,MTYPE,PHASE,N,A,IA,JA,PERM,NRHS,IPARM,MSGLVL,B,X,MPI_COMM_WORLD,IERR) IF (RANK==0) WRITE(*,*) 'MAXVAL(ABS(X)) = ',MAXVAL(ABS(X)) CALL MPI_FINALIZE(IERR) END PROGRAM MAIN

When execution is successful, the output of MAXVAL(ABS(X)) should be 1.0 as shown below:

MKL VERSION: Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel( R) 64 architecture applications MPI_REQUIRED = 3 ; MPI_PROVIDED = 3 === PARDISO is running in In-Core mode, because iparam(60)=0 === Percentage of computed non-zeros for LL^T factorization 14 % 15 % 55 % 93 % 100 % === PARDISO: solving a symmetric positive definite system === 1-based array indexing is turned ON PARDISO double precision computation is turned ON METIS algorithm at reorder step is turned ON Single-level factorization algorithm is turned ON Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.008025 s Time spent in reordering of the initial matrix (reorder) : 0.002664 s Time spent in symbolic factorization (symbfct) : 0.002441 s Time spent in data preparations for factorization (parlist) : 0.000041 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 0.153495 s Time spent in direct solver at solve step (solve) : 0.000241 s Time spent in allocation of internal data structures (malloc) : 0.008641 s Time spent in additional calculations : 0.002850 s Total time spent : 0.178398 s Statistics: =========== Parallel Direct Factorization is running on 9 OpenMP < Linear system Ax = b > number of equations: 1992 number of non-zeros in A: 58290 number of non-zeros in A (%): 1.468978 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 64 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 385 size of largest supernode: 231 number of non-zeros in L: 281152 number of non-zeros in U: 1 number of non-zeros in L+U: 281153 gflop for the numerical factorization: 0.058924 gflop/s for the numerical factorization: 0.383882 MAXVAL(ABS(X)) = 1.00000000000000

When the output is wrong, MAXVAL(ABS(X)) has some random value, as shown here:

MKL VERSION: Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel( R) 64 architecture applications MPI_REQUIRED = 3 ; MPI_PROVIDED = 3 Percentage of computed non-zeros for LL^T factorization 8 % 20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 30 % 34 % 39 % 41 % 57 % 58 % 59 % 64 % 66 % 67 % 68 % 69 % 70 % 71 % 72 % 77 % 80 % 88 % 95 % 99 % 100 % === CPARDISO: solving a symmetric positive definite system === 1-based array indexing is turned ON CPARDISO double precision computation is turned ON Minimum degree algorithm at reorder step is turned ON Single-level factorization algorithm is turned ON Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.053535 s Time spent in reordering of the initial matrix (reorder) : 0.010733 s Time spent in symbolic factorization (symbfct) : 0.005278 s Time spent in data preparations for factorization (parlist) : 0.000036 s Time spent in copying matrix to internal data structure (A to LU): 0.000005 s Time spent in factorization step (numfct) : 0.117436 s Time spent in direct solver at solve step (solve) : 0.125628 s Time spent in allocation of internal data structures (malloc) : 0.001863 s Time spent in additional calculations : 0.002628 s Total time spent : 0.317142 s Statistics: =========== Parallel Direct Factorization is running on 2 MPI and 9 OpenMP per MPI process < Linear system Ax = b > number of equations: 1992 number of non-zeros in A: 58290 number of non-zeros in A (%): 1.468978 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 64 number of independent subgraphs: 0 number of supernodes: 385 size of largest supernode: 231 number of non-zeros in L: 281152 number of non-zeros in U: 1 number of non-zeros in L+U: 281153 gflop for the numerical factorization: 0.058924 gflop/s for the numerical factorization: 0.501754 MAXVAL(ABS(X)) = 128.577018444593

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

In the two outputs posted above, one of the difference is that, in the 'buggy' run, the text "Preprocessing with state of the art partitioning metis" does not appear. Not sure if this helps.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi OP,

Thanks for the information.

I check the CQ DPD200588182 and some related forum discussion were in https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/675380.

Something seems still wrong where 2017 update 1 has problem under asd machine, but we can't reproduce internally.

And here you mentioned, the incorrect results are not always obtained; in fact, the problem shows up randomly.

We will try to investigate the issues and get back to you if any findings.

Best Regards,

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Ying,

The reproducer above may or may not have the same root cause at was identified previously in those other reports. Can you replicate the results that I obtained? It looks like the issue is when more than one OpenMP thread is used.

If you can't manage to reproduce it (as I said you will need to run the test case multiple times to see it appear), could you indicate what your OMP_xxx, MKL_xxx, KMP_xxx etc. environment variables are?

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi OP

Thank you a lot for the test case. We can reproduce the problem. Just check with you what your expected time schedule for the fix?

Best Regards,

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Ying,

Yesterday we found another case that also fails when using multiple MPI ranks and a single OpenMP thread per rank. It's a larger case than the one provided, but most certainly related to the same bug.

Initially we thought we could rely on this "multiple MPI ranks, each with only one OpenMP thread" as a workaround. Since this is not the case anymore, we can now only use the solver on a single node, with one MPI rank only (and multiple OpenMP threads). This is obviously a much, much bigger problem for us now since we can't scale anymore beyond one node.

At this point I would say that for us having this bug fixed ASAP is the highest priority. As mentioned in a private communication with Intel support, we caught this elusive bug just before a product release which has been stopped since then.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi OP,

Thank you for your reply.. Get it. We will prepare for the fix ASAP and keep you informed if any updates.

Best Regards,

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Ying - any update regarding the status of the bug fix? I just tested with 2017 Update 2 and the bug is still there.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I seem to be having a similar issue with Cluster Sparse Solver. This is for a much larger matrix (dimension of ~1 million). I have a test where three identical linear system are solved in a row. It always works with 1 MPI, but fails using 4 MPI (mvapich2_ib/2.1).

With MKL 11.2.2.164 and 4 MPI:

It consistently works (1d-16 residual for ||Ax-b||) for the first two call to the solver, but fails (without error - bad residual) for the third call.

With MKL 2018 u1 and 4 MPI:

It fails for the first call to the linear solver with following error.

Fatal error in PMPI_Bcast:

Invalid count, error stack:

PMPI_Bcast(1635)..........................: MPI_Bcast(buf=0x2aadbe1c9080, count=224415584, MPI_DOUBLE_COMPLEX, root=0, MPI_COMM_WORLD) failed

MPIR_Bcast_impl(1471).....................:

MPIR_Bcast_MV2(3041)......................:

MPIR_Bcast_intra_MV2(2338)................:

MPIR_Bcast_scatter_ring_allgather_MV2(733):

scatter_for_bcast_MV2(299)................:

MPIC_Recv(412)............................: Negative count, value is -1601980288

Thanks.

--James

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page