Problems with mkl_cluster_sparse_solver

Horst · ‎05-24-2019

Dear all,

unfortunately, I have again some troubles with the mkl_cluster_sparse_solver as in my previous topic. I have taken one of the examples intel provides in the example dir of mkl and modified it in two ways: on the one hand the code can now read an arbitrary matrix stored in the file fort.110 and on the other hand I perform a loop over the routines since I want to change the matrix within one cycle later on. The first problem arises when treating large system sizes.

In this case, you can find the matrix in fort1.zip. The program aborts with a segmentation fault after 18%: forrtl: severe (174): SIGSEGV, segmentation fault occurred. Unfortunately, this is somehow hard to track down what is the issue but it must be in the subroutine since it starts. As I said this happens for large matrices. Unfortunately I dont know how to get rid of this problem.

The next problem occurs for small matrices as found in fort.zip. The problem seems to be the loop: the first cycle everything works fine but the second cycle aborts with an error message I have already seen in one of my last topics:

Fatal error in PMPI_Reduce: Message truncated, error stack:
PMPI_Reduce(2334).................: MPI_Reduce(sbuf=0x7d7d7f8, rbuf=0x7f0b900, count=22912, MPI_DOUBLE, MPI_SUM, root=0, comm=0x84000004) failed
MPIR_Reduce_impl(1439)............: fail failed
I_MPIR_Reduce_intra(1533).........: Failure during collective
MPIR_Reduce_intra(1201)...........: fail failed
MPIR_Reduce_Shum_ring(833)........: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 1 and tag 11 truncated; 14000 bytes received but buffer size is 1296

I have tried what I did the last time: provide all parameters (nhrs, msglevel, iparm, ..) for all ranks again but it does not seem to fix the issue.

This is the program code (cl_solver_f90.f90):

program cluster_sparse_solver
use mkl_cluster_sparse_solver
implicit none
include 'mpif.h'
integer, parameter :: dp = kind(1.0D0)
!.. Internal solver memory pointer for 64-bit architectures
TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE)  :: pt(64)

integer maxfct, mnum, mtype, phase, nrhs, error, msglvl, i, ik, l1, k1, idum(1), DimensionL, Nsparse
integer*4 mpi_stat, rank, num_procs
double precision :: ddum(1)

integer, allocatable :: IA( : ),  JA( : ), iparm( : )
double precision, allocatable :: VAL( : ), rhodot( : ), rho( : )

integer(4) MKL_COMM


MKL_COMM=MPI_COMM_WORLD
call mpi_init(mpi_stat)
call mpi_comm_rank(MKL_COMM, rank, mpi_stat)


do l1 = 1, 64
  pt(l1)%dummy = 0
end do

 error       = 0   ! initialize error flag
 msglvl      = 1   ! print statistical information
 mtype       = 11  ! real, non-symmetric
 nrhs        = 1
 maxfct      = 1
 mnum        = 1

allocate(iparm(64))
 
do l1 = 1, 64
 iparm(l1) = 0
end do

!Setup PARDISO control parameter
 iparm(1)  = 1   ! do not use default values
 iparm(2)  = 3   ! fill-in reordering from METIS
 iparm(8)  = 100 ! Max. number of iterative refinement steps on entry
 iparm(10) = 13  ! perturb the pivot elements with 1E-13
 iparm(11) = 1   ! use nonsymmetric permutation and scaling MPS
 iparm(13) = 1   ! Improved accuracy using nonsymmetric weighted matching
 iparm(27) = 1   ! checks whether column indices are sorted in increasing order within each row

read(110,*) DimensionL, Nsparse

allocate(VAL(Nsparse),JA(Nsparse),IA(DimensionL))

if (rank.eq.0) then
do k1=1,Nsparse
read(110,*) VAL(k1)
end do
do k1=1,DimensionL+1
read(110,*) IA(k1)
end do
do k1=1,Nsparse
read(110,*) JA(k1)
end do
end if

allocate(rhodot(DimensionL), rho(DimensionL))

if (rank.eq.0) then
rhodot=0.0d0
rhodot(1) = 1.0d0
rho=0.0d0
end if

if (rank.eq.0) write(*,*) 'INIT PARDISO'

ik = 0
Pardisoloop: do

ik = ik + 1

phase = 12
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error

phase = 33
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, rhodot, rho, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error

if (ik.ge.4) exit Pardisoloop

end do Pardisoloop


call MPI_BARRIER(MKL_COMM,mpi_stat)

phase = -1
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'Release of memory: ', error


call mpi_finalize(mpi_stat)

end

I compile with

mpiifort -i8 -I${MKLROOT}/include -c -o mkl_cluster_sparse_solver.o ${MKLROOT}/include/mkl_cluster_sparse_solver.f90
mpiifort -i8 -I${MKLROOT}/include -c -o cl_solver_f90.o cl_solver_f90.f90
mpiifort mkl_cluster_sparse_solver.o cl_solver_f90.o -o MPI.out -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

and run the program with mpiexec -n 2 ./MPI.out. Our cluster has 16 cores per node and I request two of them. Ram should not be the problem (64gb), since it perfectly runs with the normal pardiso on just one node. I set export MKL_NUM_THREADS=16. Am I right that the slave MPI process should automatically obtain parts of the LL^T factors or do I have to use the distributed version in order to do so? The reason why I ask is that I cannot observe any process running on the second node.

The Versions are: MKL version: 2017.4.256, Ifort version: 17.0.6.256, IMPI version: 2017.4.239, but my college can also reproduce the issue on other versions/clusters.

Thanks in advance,

Horst

Horst · ‎05-27-2019

Just an update since I somehow proceed with the problem

I think the problem of the segmentation fault is somehow connected to openmp, since I sometimes get this error message:

libiomp5.so 00002B71618FFD43 __kmp_invoke_micr Unknown Unknown
libiomp5.so 00002B71618CE7AD __kmp_fork_call Unknown Unknown
libiomp5.so 00002B71618A682E __kmpc_fork_call Unknown Unknown

This error is obtained when running valgrind --tool=memcheck --track-origins=yes --leak-check=yes -v --leak-check=full --show-reachable=yes. However, the rest of the output does (at least to me) not give any further information about what the error could be.

The problem of the loop can sometimes be solved by calling

call MPI_Bcast(pt,64,MPI_INTEGER, 0, MKL_COMM,mpi_stat)

sometimes it still gives an error. Moreover, I am not sure if I can update the pt like this, since it is in principle not an integer.

Kirill_V_Intel · ‎05-27-2019

Hello Horst,

The documentation for the cluster sparse solver (https://software.intel.com/en-us/mkl-developer-reference-fortran-cluster-sparse-solver) says, that the entries of pt must be set to zero before the first call. Thus, in-between the loop iterations you need to set pt elements to 0. I quickly checked that this seems to fix the issue. Let me know if you still see them and under which conditions then.

As I understand your intention is to change the matrix between the loop iterations. How exactly do you want to re-use the previous calls to the cluster sparse solver? If you just want to call the entire functionality again, starting with reordering, then you should consider the next loop as calling cluster sparse solver "for the first time" and again set pt entries to 0. If you want to do something else, e.g., re=using the existing factorization (as approximate inverse for the modified matrix), you can use only the solve step (with iterative refinement turned on) and then you should not do anything about the pt.

Notice that you trick with broadcasting pt doesn't do the right thing.

Hope this helps!

Best,
Kirill

Horst · ‎05-28-2019

Thank you very much for your response.

I also checked it and yes it fixes the issue. Yes, my idea is to use the iterative refinement or when needed the reordering and factorization. I interpreted the first call as the very first call before the loop starts. However, not setting iparm=0 for pardiso_64 works fine. Thats the reason why I didn't think about doing this again. Thank you for this information.

Concerning the segmentation fault, I have seen a strange behavior in my further investigations: the program calculates to a higher percentage of LL^T factors when I set export OMP_NUM_THREADS=1 instead of export OMP_NUM_THREADS=16. For sure, the calculations last longer but the segmentation fault occurs later. Then I think openmp is not the issue (as I thought) since in the case of one thread it does not work either.

Best,
Horst

Kirill_V_Intel · ‎05-28-2019

Hello Horst,

To avoid misunderstanding, am I right that the problem with the smaller matrix has been solved now? So, the segfault happens for the larger matrix, right?

As for the percentage computed before the segfault, it's normal that the numbers are different for different thread number.

Thanks,
Kirill

Horst · ‎05-28-2019

Hello Kirill,

thank you for your response. Yes, the problem with the smaller matrix has been solved. And yes, the segfault still happens for the larger matrix. Do you have any idea how I can track down the problem of this segfault? Thanks in advance.

Best,
Horst

Kirill_V_Intel · ‎05-28-2019

Hello Horst,

We'll investigate the issue. As a workaround I suggest temporarily switching to the different code path by disabling matching, scaling and distributed reordering. That is, set iparm(2) = 2, iparm(11) = 0 and iparm(13) = 0.

Best,
Kirill

Horst · ‎05-29-2019

Hello Kirill,

thank you for your suggestions. Unfortunately, I cannot report that this workaround works. The segfault still occurs. I also tried some combination of it but it changed nothing.

Thanks for investigating this problem.

Kirill_V_Intel · ‎05-29-2019

Hello Horst,

Can you share the code, run settings and the output for your last statement? I'm kinda surprised that the segfault still exists, I don't observe it in my runs.

Thanks,
Kirill

Horst · ‎05-29-2019

Hey Kirill,

for sure, here is the code:

program cluster_sparse_solver
use mkl_cluster_sparse_solver
implicit none
include 'mpif.h'
integer, parameter :: dp = kind(1.0D0)
!.. Internal solver memory pointer for 64-bit architectures
TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE)  :: pt(64)

integer maxfct, mnum, mtype, phase, nrhs, error, msglvl, i, ik, l1, k1, idum(1), DimensionL, Nsparse
integer*4 mpi_stat, rank, num_procs
double precision :: ddum(1)

integer, allocatable :: IA( : ),  JA( : ), iparm( : )
double precision, allocatable :: VAL( : ), rhodot( : ), rho( : )

integer(4) MKL_COMM


MKL_COMM=MPI_COMM_WORLD
call mpi_init(mpi_stat)
call mpi_comm_rank(MKL_COMM, rank, mpi_stat)


do l1 = 1, 64
  pt(l1)%dummy = 0
end do

 error       = 0   ! initialize error flag
 msglvl      = 1   ! print statistical information
 mtype       = 11  ! real, non-symmetric
 nrhs        = 1
 maxfct      = 1
 mnum        = 1

allocate(iparm(64))
 
do l1 = 1, 64
 iparm(l1) = 0
end do

!Setup PARDISO control parameter
 iparm(1)  = 1   ! do not use default values
 iparm(2)  = 2   ! fill-in reordering from METIS
 iparm(8)  = 100 ! Max. number of iterative refinement steps on entry
 iparm(10) = 13  ! perturb the pivot elements with 1E-13
 iparm(27) = 1   ! checks whether column indices are sorted in increasing order within each row

read(110,*) DimensionL, Nsparse

allocate(VAL(Nsparse),JA(Nsparse),IA(DimensionL))

if (rank.eq.0) then
do k1=1,Nsparse
read(110,*) VAL(k1)
end do
do k1=1,DimensionL+1
read(110,*) IA(k1)
end do
do k1=1,Nsparse
read(110,*) JA(k1)
end do
end if

allocate(rhodot(DimensionL), rho(DimensionL))

if (rank.eq.0) then
rhodot=0.0d0
rhodot(1) = 1.0d0
rho=0.0d0
end if

if (rank.eq.0) write(*,*) 'INIT PARDISO'

phase = 12
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error

phase = 33
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, rhodot, rho, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error


call MPI_BARRIER(MKL_COMM,mpi_stat)

phase = -1
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'Release of memory: ', error


call mpi_finalize(mpi_stat)

end

I compile the program with

mpiifort -i8 -I${MKLROOT}/include -c -o mkl_cluster_sparse_solver.o ${MKLROOT}/include/mkl_cluster_sparse_solver.f90
mpiifort -i8 -I${MKLROOT}/include -c -o cl_solver_f90.o cl_solver_f90.f90
mpiifort mkl_cluster_sparse_solver.o cl_solver_f90.o -o MPI.out -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

and run the program with mpiexec -n 2 ./MPI.out. The output is the following:

INIT PARDISO

Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 % 10 % 11 % 12 %

and the segfault is:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred

If you need any further information, please let me know.

Best

Kirill_V_Intel · ‎05-29-2019

This one, with advanced options turned off, I cannot reproduce with the latest versions of Intel MKL, though I did observe the failure for MKL 2017 Update 4. Since that release we've had a lot of updates for the cluster sparse solver.

Best,
Kirill

Horst · ‎05-30-2019

Hello,

thank you for reply. Unfortunately, I have no access to the latest version. I have tried it up to the 2018 version. Then we need to update the compiler.

Thank you very much!

Gennady_F_Intel · ‎12-24-2019

Hi Horst,

The fix of this issue available in MKL v.2020 ( the Initial version) and MKL 2019 u5. You may check these versions and let us know how it works on your side.