Cluster sparse solver error when using more than 1 MPI process

OP1 · ‎02-02-2017

[Edited title to more accurately reflect the issue at hand]

Is the build number for the latest Intel MKL version 20161005 ? We are still struggling with a bug that was reported last year, regarding usage of the sparse cluster solver when using more than one rank.

The release notes for Intel MKL 2017 Update 1 indicate that this bug was fixed, but before we re-open it I'd like to make sure our test (done with 20161005) used the latest and greatest.

Jing_Xu · ‎02-02-2017

On Linux it is 2017.1.132

On Windows it is 2017.1.143

Ying_H_Intel · ‎02-02-2017

Hi OP,

Could you let us know the bug number or other link information ? so we can check the fix information?

Yes, the latest official release version is MKL 2017 update 1. ( update 2 will be release in the month, you should be able to get notification automatically from system)

You can check the build data from mkl_version.h file.

for example under windows:

#if 0
/*
! Content:
! Intel(R) Math Kernel Library (MKL) interface
!******************************************************************************/
#endif

#ifndef _MKL_VERSION_H_
#define _MKL_VERSION_H_

#define __INTEL_MKL_BUILD_DATE 20161006

#define __INTEL_MKL__ 2017
#define __INTEL_MKL_MINOR__ 0
#define __INTEL_MKL_UPDATE__ 1

#define INTEL_MKL_VERSION 20170001

#endif

and under Linux :

#if 0
/*
! Content:
! Intel(R) Math Kernel Library (MKL) interface
!******************************************************************************/
#endif

#ifndef _MKL_VERSION_H_
#define _MKL_VERSION_H_

#define __INTEL_MKL_BUILD_DATE 20161006

#define __INTEL_MKL__ 2017
#define __INTEL_MKL_MINOR__ 0
#define __INTEL_MKL_UPDATE__ 1

#define INTEL_MKL_VERSION 20170001

You get the build data from sparse cluster solver, right? there is a little difference, but the version should be MKL 2017 update 1.

Best Regards

Ying

OP1 · ‎02-03-2017

Here is a reproducer of the bug. The sparse cluster solver is used to solve a small Ax=b system. It seems there is an issue with OpenMP threading/MPI interaction.

The bug will only appear once in a while, so it is critical to create a test matrix (and repeat the test multiple times) for each combination of MPI processes/OpenMP threads. This bug may be related to DPD200588182 that we reported previously and was marked as 'fixed' in the release notes here: https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-bug-fixes-list

Using MPI_INIT_THREAD() instead of MPI_INIT() helped 'a little' in the sense that the occurrence of the bug is 'reduced'.

The latest version of MKL is used - see the comment section in the source file below for a description of the commands setting up the environment, as well as the compile and build commands. The two required input files are attached.

! The matrix A is real, symmetric positive definite - and only its lower triangle is provided. A right-hand side vector B is
! provided so that the linear system A.X = B can be solved with CLUSTER_SPARSE_SOLVER.
!
! File inputs:
!
!     - The file "file_1.txt" contains the size of the matrix A, its number of nonzero elements, the number of right-hand
!       sides to be analyzed, and the index arrays IA and JA.
!     - The file "file_2.txt" contains the matrix values and the right hand sides.
!
! Bug description:
!
!     - The output "MAXVAL(ABS(X))" should be equal to 1.
!
!     - For some combinations of (1) number of MPI processes (ranks) and (2) number of OpenMP threads per rank, the output
!       "MAXVAL(ABS(X))" becomes incorrect.
!
!     - The incorrect results are not always obtained; in fact, the problem shows up randomly.
!
!     - In order to reproduce the bug, create a test matrix similar to the one below
!
!
!                     | Number of
!                     | OpenMP threads                        A "PASS" indicates that the correct result was obtained in 15
!     Number of MPI   | 1        2        3        9          consecutive runs of the code. A "FAIL" indicates that at least
!     processes       |                                       one run was faulty.
!                     |-------------------------------
!               1     | pass     pass     pass     pass
!               2     | pass     pass     pass     > FAIL <   Note that many of these MPI ranks/OpenMP threads combinations were
!               3     | pass     pass     pass     pass       oversubscribing the computational resources (cores) at hand.
!               7     | pass     pass     pass     pass
!              16     | pass     pass     pass     pass
!
!
!
! Compilation/build/run commands (Linux):
!
!     
! > module load icc/17.0
! > . /opt/intel/compiler17/compilers_and_libraries_2017.1.132/linux/bin/compilervars.sh intel64
! > . /opt/intel/compiler17/compilers_and_libraries_2017.1.132/linux/mkl/bin/mklvars.sh intel64 mod
! > . /opt/intel/compiler17/compilers_and_libraries_2017.1.132/linux/mpi/bin64/mpivars.sh
! > mpiifort -c $MKLROOT/include/mkl_spblas.f90
! > mpiifort -c $MKLROOT/include/mkl_cluster_sparse_solver.f90
! > mpiifort -c MAIN.f90
! > mpiifort -o TEST *.o -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64
!                        -liomp5 -lpthread -lm -ldl
! > export OMP_NUM_THREADS=9
! > mpirun -n 2 TEST


PROGRAM MAIN
USE MPI
USE MKL_CLUSTER_SPARSE_SOLVER
USE MKL_SPBLAS

IMPLICIT NONE

INTEGER(KIND=4),PARAMETER   :: FILE_UNIT_1 = 1000
INTEGER(KIND=4),PARAMETER   :: FILE_UNIT_2 = 2000
INTEGER(KIND=4)             :: NUM_PROCS,RANK,IERR,MAXFCT,MNUM,MTYPE,PHASE,MSGLVL,IPARM(64),PERM(1),N,NRHS,NNZ,II
INTEGER(KIND=4)             :: MPI_REQUIRED,MPI_PROVIDED
INTEGER(KIND=4),ALLOCATABLE :: IA(:), JA(:)
REAL(KIND=8),ALLOCATABLE    :: A(:),B(:,:),X(:,:)
LOGICAL                     :: MPI_IS_INITIALIZED
CHARACTER(LEN=128)          :: MKL_STRING
TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE) :: PT(64)

MPI_REQUIRED = MPI_THREAD_MULTIPLE
CALL MPI_INIT_THREAD(MPI_REQUIRED,MPI_PROVIDED,IERR)  
CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NUM_PROCS,IERR)
CALL MPI_BARRIER(MPI_COMM_WORLD, IERR)
CALL MKL_GET_VERSION_STRING(MKL_STRING)
IF (RANK==0) THEN
    WRITE(*,*) 'MKL VERSION: ',TRIM(ADJUSTL(MKL_STRING))
    WRITE(*,*) 'MPI_REQUIRED = ',MPI_REQUIRED,'; MPI_PROVIDED = ',MPI_PROVIDED
    OPEN(FILE_UNIT_1,FILE='file_1.txt',FORM='FORMATTED',STATUS='OLD')
    READ(FILE_UNIT_1,*) N
    READ(FILE_UNIT_1,*) NNZ
    READ(FILE_UNIT_1,*) NRHS
END IF

CALL MPI_BCAST(NRHS,1,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
CALL MPI_BCAST(NNZ,1,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
CALL MPI_BCAST(N,1,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
    
ALLOCATE(IA(N+1),JA(NNZ),A(NNZ),B(N,NRHS),X(N,NRHS))
    
IF (RANK==0) THEN
    READ(FILE_UNIT_1,*) IA
    READ(FILE_UNIT_1,*) JA
    CLOSE(FILE_UNIT_1)
    OPEN(UNIT=FILE_UNIT_2,FILE='file_2.txt',FORM='FORMATTED',STATUS='OLD')
    READ(FILE_UNIT_2,*) A
    READ(FILE_UNIT_2,*) B(:,1)
    CLOSE(FILE_UNIT_2)
ENDIF

MAXFCT = 1
MNUM   = 1
MTYPE  = 2
PERM   = 0
MSGLVL = 1

DO II=1,64
    PT(II)%DUMMY = 0
ENDDO

PHASE = 13
CALL CLUSTER_SPARSE_SOLVER(PT,MAXFCT,MNUM,MTYPE,PHASE,N,A,IA,JA,PERM,NRHS,IPARM,MSGLVL,B,X,MPI_COMM_WORLD,IERR)

IF (RANK==0) WRITE(*,*) 'MAXVAL(ABS(X)) = ',MAXVAL(ABS(X))
            
CALL MPI_FINALIZE(IERR)
    
END PROGRAM MAIN

When execution is successful, the output of MAXVAL(ABS(X)) should be 1.0 as shown below:

 MKL VERSION: 
 Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(
 R) 64 architecture applications
 MPI_REQUIRED =            3 ; MPI_PROVIDED =            3
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization
 14 %  15 %  55 %  93 %  100 % 

=== PARDISO: solving a symmetric positive definite system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Single-level factorization algorithm is turned ON


Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.008025 s
Time spent in reordering of the initial matrix (reorder)         : 0.002664 s
Time spent in symbolic factorization (symbfct)                   : 0.002441 s
Time spent in data preparations for factorization (parlist)      : 0.000041 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 0.153495 s
Time spent in direct solver at solve step (solve)                : 0.000241 s
Time spent in allocation of internal data structures (malloc)    : 0.008641 s
Time spent in additional calculations                            : 0.002850 s
Total time spent                                                 : 0.178398 s

Statistics:
===========
Parallel Direct Factorization is running on 9 OpenMP

< Linear system Ax = b >
             number of equations:           1992
             number of non-zeros in A:      58290
             number of non-zeros in A (%): 1.468978

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 64
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    385
             size of largest supernode:               231
             number of non-zeros in L:                281152
             number of non-zeros in U:                1
             number of non-zeros in L+U:              281153
             gflop   for the numerical factorization: 0.058924

             gflop/s for the numerical factorization: 0.383882

 MAXVAL(ABS(X)) =    1.00000000000000

When the output is wrong, MAXVAL(ABS(X)) has some random value, as shown here:

 MKL VERSION: 
 Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(
 R) 64 architecture applications
 MPI_REQUIRED =            3 ; MPI_PROVIDED =            3

Percentage of computed non-zeros for LL^T factorization
 8 %  20 %  21 %  22 %  23 %  24 %  25 %  26 %  27 %  30 %  34 %  39 %  41 %  57 %  58 %  59 %  64 %  66 %  67 %  68 %  69 %  70 %  71 %  72 %  77 %  80 %  88 %  95 %  99 %  100 % 

=== CPARDISO: solving a symmetric positive definite system ===
1-based array indexing is turned ON
CPARDISO double precision computation is turned ON
Minimum degree algorithm at reorder step is turned ON
Single-level factorization algorithm is turned ON


Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.053535 s
Time spent in reordering of the initial matrix (reorder)         : 0.010733 s
Time spent in symbolic factorization (symbfct)                   : 0.005278 s
Time spent in data preparations for factorization (parlist)      : 0.000036 s
Time spent in copying matrix to internal data structure (A to LU): 0.000005 s
Time spent in factorization step (numfct)                        : 0.117436 s
Time spent in direct solver at solve step (solve)                : 0.125628 s
Time spent in allocation of internal data structures (malloc)    : 0.001863 s
Time spent in additional calculations                            : 0.002628 s
Total time spent                                                 : 0.317142 s

Statistics:
===========
Parallel Direct Factorization is running on 2 MPI and 9 OpenMP per MPI process

< Linear system Ax = b >
             number of equations:           1992
             number of non-zeros in A:      58290
             number of non-zeros in A (%): 1.468978

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 64
             number of independent subgraphs:  0
             number of supernodes:                    385
             size of largest supernode:               231
             number of non-zeros in L:                281152
             number of non-zeros in U:                1
             number of non-zeros in L+U:              281153
             gflop   for the numerical factorization: 0.058924

             gflop/s for the numerical factorization: 0.501754

 MAXVAL(ABS(X)) =    128.577018444593

OP1 · ‎02-03-2017

In the two outputs posted above, one of the difference is that, in the 'buggy' run, the text "Preprocessing with state of the art partitioning metis" does not appear. Not sure if this helps.

Ying_H_Intel · ‎02-03-2017

Hi OP,

Thanks for the information.

I check the CQ DPD200588182 and some related forum discussion were in https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/675380.

Something seems still wrong where 2017 update 1 has problem under asd machine, but we can't reproduce internally.
And here you mentioned, the incorrect results are not always obtained; in fact, the problem shows up randomly.
We will try to investigate the issues and get back to you if any findings.

Best Regards,
Ying

OP1 · ‎02-06-2017

Hi Ying,

The reproducer above may or may not have the same root cause at was identified previously in those other reports. Can you replicate the results that I obtained? It looks like the issue is when more than one OpenMP thread is used.

If you can't manage to reproduce it (as I said you will need to run the test case multiple times to see it appear), could you indicate what your OMP_xxx, MKL_xxx, KMP_xxx etc. environment variables are?

Thanks!

Ying_H_Intel · ‎02-08-2017

Hi OP

Thank you a lot for the test case. We can reproduce the problem. Just check with you what your expected time schedule for the fix?

Best Regards,

Ying

OP1 · ‎02-09-2017

Ying,

Yesterday we found another case that also fails when using multiple MPI ranks and a single OpenMP thread per rank. It's a larger case than the one provided, but most certainly related to the same bug.

Initially we thought we could rely on this "multiple MPI ranks, each with only one OpenMP thread" as a workaround. Since this is not the case anymore, we can now only use the solver on a single node, with one MPI rank only (and multiple OpenMP threads). This is obviously a much, much bigger problem for us now since we can't scale anymore beyond one node.

At this point I would say that for us having this bug fixed ASAP is the highest priority. As mentioned in a private communication with Intel support, we caught this elusive bug just before a product release which has been stopped since then.

Ying_H_Intel · ‎02-10-2017

Hi OP,

Thank you for your reply.. Get it. We will prepare for the fix ASAP and keep you informed if any updates.

Best Regards,

Ying

OP1 · ‎02-23-2017

Ying - any update regarding the status of the bug fix? I just tested with 2017 Update 2 and the bug is still there.

kestyn__james · ‎03-05-2018

I seem to be having a similar issue with Cluster Sparse Solver. This is for a much larger matrix (dimension of ~1 million). I have a test where three identical linear system are solved in a row. It always works with 1 MPI, but fails using 4 MPI (mvapich2_ib/2.1).

With MKL 11.2.2.164 and 4 MPI:

It consistently works (1d-16 residual for ||Ax-b||) for the first two call to the solver, but fails (without error - bad residual) for the third call.

With MKL 2018 u1 and 4 MPI:

It fails for the first call to the linear solver with following error.

Fatal error in PMPI_Bcast:

Invalid count, error stack:

PMPI_Bcast(1635)..........................: MPI_Bcast(buf=0x2aadbe1c9080, count=224415584, MPI_DOUBLE_COMPLEX, root=0, MPI_COMM_WORLD) failed

MPIR_Bcast_impl(1471).....................:

MPIR_Bcast_MV2(3041)......................:

MPIR_Bcast_intra_MV2(2338)................:

MPIR_Bcast_scatter_ring_allgather_MV2(733):

scatter_for_bcast_MV2(299)................:

MPIC_Recv(412)............................: Negative count, value is -1601980288

Thanks.

--James