Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6977 Discussions

MKL : Error running cluster_sparse_solver with -check_mpi file and tracer in Linux, PS XE 2020.

Marcos_V_1
New Contributor I
5,047 Views

Dear Gennady and Kirill,

We've come across an error trying to use the tracer tool to debug the MPI section of our code using the -check_mpi linking flag. The error happens within the first call to cluster_sparse_solver (Symbolic factorization). We get an error for collective SIZE mismatch in a call to MPI_Gatherv from MKLMPI_Gatherv. We've noted this also in our main source code (FDS) in Linux also using IMPI, also Parallel Studio XE 2020 u1.

I used our demonstration code the solver an 8 MPI process Poisson problem using the cluster_sparse_solver to verify the find. Use the tarball attached and follow the instructions in the README: 

1. type: $ source /opt/intel20/parallel_studio_xe_2020/psxevars.sh

2. make a test dir in the same level as the source/ directory extracted

3. In source/ execute the make_test.sh to compile

4. In test/ run in test the css_test program with 8 MPI procs.

Any help on why this is coming up would ge gratly appreciated. 

Thank you for your time and attention.

Marcos

 

PS: Here is the std error:

[~test]$ mpirun -n 8 ./css_test

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20


Starting Program ...

MPI Process 0 started on blaze.el.nist.gov
MPI Process 1 started on blaze.el.nist.gov
MPI Process 2 started on blaze.el.nist.gov
MPI Process 3 started on blaze.el.nist.gov
MPI Process 4 started on blaze.el.nist.gov
MPI Process 5 started on blaze.el.nist.gov
MPI Process 6 started on blaze.el.nist.gov
MPI Process 7 started on blaze.el.nist.gov
Into factorization Phase..

[0] ERROR: GLOBAL:COLLECTIVE:SIZE_MISMATCH: error
[0] ERROR: Mismatch found in local rank [0] (global rank [0]),
[0] ERROR: other processes may also be affected.
[0] ERROR: Root expects 442368 items but 110592 sent by local rank [0] (same as global rank):
[0] ERROR: MPI_Gatherv(*sendbuf=0x2b6882aac240, sendcount=110592, sendtype=MPI_INT, *recvbuf=0x2b6882f64080, *recvcounts=0xa4f5c80, *displs=0xa4f5d00, recvtype=MPI_INT, root=0, comm=0xffffffffc4000000 SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: No problem found in the 7 processes with local ranks [1:7] (same as global ranks):
[0] ERROR: MPI_Gatherv(*sendbuf=..., sendcount=110592, sendtype=MPI_INT, *recvbuf=..., *recvcounts=..., *displs=..., recvtype=MPI_INT, root=0, comm=... SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting
[0] WARNING: starting premature shutdown

[0] INFO: GLOBAL:COLLECTIVE:SIZE_MISMATCH: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

....

.....

0 Kudos
33 Replies
Marcos_V_1
New Contributor I
1,718 Views

Good morning Gennady, I understand it is the MPI library that comes with update 4. Please confirm:

$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923 (id: abd58e492)
Copyright 2003-2020, Intel Corporation.

$ which mpirun
/opt/intel20/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpirun

Thank you,

Marcos

 

0 Kudos
Gennady_F_Intel
Moderator
1,711 Views

Hi Marcos.

yes, I checked the problem with 2 years old version of MPI. We will check with the current ( latest) one and get back to this thread asap.

0 Kudos
Gennady_F_Intel
Moderator
1,703 Views

Hi Markos,

yes, I have to confirm that I see exactly the same issue that you reported the last time after 1500 steps. We have to investigate the cause of the issue and will keep this thread updated. 

-Gennady

0 Kudos
Marcos_V_1
New Contributor I
1,694 Views
0 Kudos
Marcos_V_1
New Contributor I
1,656 Views

Hi Gennady, any updates on this last issue? 

Thank you,

Marcos

0 Kudos
Kirill_V_Intel
Employee
1,651 Views

Hi Marcos,

The issue is unrelated to MKL and most likely MPI, in fact. It has been reproduced with a simple code which does a lot of coomunicator split and free operations (MPI_Comm_split and MPI_Comm_free). The failure happens due to the ITAC (I suspect some internal book keeping for the MPI communicators has an internal size limit which gets exceeded).

ITAC team has been informed about the issue, me or Gennady will make an update once there is a response there. 

Unfortunately, I cannot suggest a reliable workaround except not using ITAC temporarily or doing less solving calls / using fewer MPI processes (which would only defer the failure for some time I guess).

Best,
Kirill

0 Kudos
Marcos_V_1
New Contributor I
1,643 Views

Hi Kirill, thank you for the update. We are eager to be able to add the mpi checking option to our development compilation targets and make it part of our software development workflow.

What is the ITAC?

Thank you,

Marcos

0 Kudos
Gennady_F_Intel
Moderator
1,637 Views

Marcos,

ITAC means Intel Trace Analyzer and Collector team. It actually is part of MPI team.


0 Kudos
PrasanthD_intel
Moderator
1,565 Views

Hi Marcos,


Thanks for your patience. The issue raised by you has been fixed in the MKL 2020.4 version. Please download the latest OneAPI version (2021.2) for the latest oneMKL version and let us know your experience with it.


0 Kudos
Marcos_V_1
New Contributor I
1,548 Views

Good morning Prasanth, the original issue with MKL has been resolved. The second issue on the same test case related to using the -check_mpi flag in the linking of the test problem seems to still be there (Gennady commented it is related to MPI not MKL). I can still reproduce it with oneAPI update 2.

To see the issue, compile the code adding -check_mpi to the link flags in make_test.sh and run the case as explained. I still get the following:

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20


Starting Program ...

MPI Process 0 started on burn034
MPI Process 1 started on burn034
MPI Process 2 started on burn034
MPI Process 3 started on burn034
MPI Process 4 started on burn034
MPI Process 5 started on burn034
MPI Process 6 started on burn034
MPI Process 7 started on burn034
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
[6] ERROR: Unexpected MPI error, aborting:
[6] ERROR: Invalid communicator, error stack:
[6] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0xa702140) failed
[6] ERROR: PMPI_Comm_free(85).: Null communicator
[7] ERROR: Unexpected MPI error, aborting:
[7] ERROR: Invalid communicator, error stack:
[7] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0xa9c30d0) failed
[7] ERROR: PMPI_Comm_free(85).: Null communicator
Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
Abort(1) on node 7 (rank 7 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
~

 

Thank you,

Marcos

0 Kudos
PrasanthD_intel
Moderator
1,519 Views

Hi Marcos,


Sorry for the delay in response.

Yes the MKL issue has been resolved in MKL 2020.4  version whereas the ITAC issue fix was not available as of now and the fix will be available in the next version ITAC 2021.3 which I believe to be released as part of OneAPI 2021.3.


Let us know if we can close this thread for now.


Regards

Prasanth


0 Kudos
Kevin_McGrattan
1,497 Views

Yes, close the thread. I'll open a new one if there is any issue with the new release. Thanks.

0 Kudos
PrasanthD_intel
Moderator
1,488 Views

Hi,


Thanks for the confirmation.

As your issue has been resolved, we are closing this thread. We will no longer respond to this thread. If you find any issues in the latest version, please start a new thread. Any further interaction in this thread will be considered community only


Regards

Prasanth


0 Kudos
Reply