MKL : Error running cluster_sparse_solver with -check_mpi file and tracer in Linux, PS XE 2020.

Marcos_V_1 · ‎09-07-2020

Dear Gennady and Kirill,

We've come across an error trying to use the tracer tool to debug the MPI section of our code using the -check_mpi linking flag. The error happens within the first call to cluster_sparse_solver (Symbolic factorization). We get an error for collective SIZE mismatch in a call to MPI_Gatherv from MKLMPI_Gatherv. We've noted this also in our main source code (FDS) in Linux also using IMPI, also Parallel Studio XE 2020 u1.

I used our demonstration code the solver an 8 MPI process Poisson problem using the cluster_sparse_solver to verify the find. Use the tarball attached and follow the instructions in the README:

1. type: $ source /opt/intel20/parallel_studio_xe_2020/psxevars.sh

2. make a test dir in the same level as the source/ directory extracted

3. In source/ execute the make_test.sh to compile

4. In test/ run in test the css_test program with 8 MPI procs.

Any help on why this is coming up would ge gratly appreciated.

Thank you for your time and attention.

Marcos

PS: Here is the std error:

[~test]$ mpirun -n 8 ./css_test

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

Starting Program ...

MPI Process 0 started on blaze.el.nist.gov
MPI Process 1 started on blaze.el.nist.gov
MPI Process 2 started on blaze.el.nist.gov
MPI Process 3 started on blaze.el.nist.gov
MPI Process 4 started on blaze.el.nist.gov
MPI Process 5 started on blaze.el.nist.gov
MPI Process 6 started on blaze.el.nist.gov
MPI Process 7 started on blaze.el.nist.gov
Into factorization Phase..

[0] ERROR: GLOBAL:COLLECTIVE:SIZE_MISMATCH: error
[0] ERROR: Mismatch found in local rank [0] (global rank [0]),
[0] ERROR: other processes may also be affected.
[0] ERROR: Root expects 442368 items but 110592 sent by local rank [0] (same as global rank):
[0] ERROR: MPI_Gatherv(*sendbuf=0x2b6882aac240, sendcount=110592, sendtype=MPI_INT, *recvbuf=0x2b6882f64080, *recvcounts=0xa4f5c80, *displs=0xa4f5d00, recvtype=MPI_INT, root=0, comm=0xffffffffc4000000 SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: No problem found in the 7 processes with local ranks [1:7] (same as global ranks):
[0] ERROR: MPI_Gatherv(*sendbuf=..., sendcount=110592, sendtype=MPI_INT, *recvbuf=..., *recvcounts=..., *displs=..., recvtype=MPI_INT, root=0, comm=... SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting
[0] WARNING: starting premature shutdown

[0] INFO: GLOBAL:COLLECTIVE:SIZE_MISMATCH: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

....

.....

Kirill_V_Intel · ‎09-07-2020

Hello Marcos,

Just a quick question while I'm looking for the PSXE at my disposal: do you see any falures when you don't use the trace analyzer and collector?

Thanks,
Kirill

Marcos_V_1 · ‎09-08-2020

Morning Kirill, thank you for looking into this. I actually also see the error only invoking the -check_mpi linking flag when compiling, without sourcing psxevars.sh.

So, just compiling and running css_test you should be able to reproduce the error.

Thank yo for your time, best

Marcos

Marcos_V_1 · ‎09-08-2020

Sorry, what I meant by this is running the compiled css_test with -check_mpi and pxevars.sh sourced in a terminal where psxevars.sh has not been sourced. It probably is the same situation as having sourced psxevar.sh.

In order to be able to compile with -check_mpi you need to source psxevars.sh. Without the flag the code runs.

Gennady_F_Intel · ‎09-08-2020

compiling and running your example without -check_mpi,

I see no problems on my end:

Starting Program ...

MPI Process 0 started on cerberos
MPI Process 1 started on cerberos
MPI Process 2 started on cerberos
MPI Process 6 started on cerberos
MPI Process 7 started on cerberos
MPI Process 3 started on cerberos
MPI Process 4 started on cerberos
MPI Process 5 started on cerberos
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
NSOLVES = 1600
NSOLVES = 1700
NSOLVES = 1800
NSOLVES = 1900
NSOLVES = 2000
NSOLVES = 2100
NSOLVES = 2200
NSOLVES = 2300
NSOLVES = 2400

......

Marcos_V_1 · ‎09-08-2020

Hi Gennady, correct. The error comes with compiling with the -check_mpi flag (previously sourcing psxvars.sh).

Kirill_V_Intel · ‎09-08-2020

Hi all,

I confirm the issue. The test fails when it is run with -check_mpi as Marcos described (I believe the Trace analyzer and collector forces the stop). The reported size mismatch needs to be investigated.

Best,
Kirill

Gennady_F_Intel · ‎09-09-2020

The issue is escalated and this thread would be keep being updated.

Kirill_V_Intel · ‎09-09-2020

Hello Marcos,

The root cause is a bug in how the distributed CSR matrix is assembled inside the cluster sparse solver. We'll fix it.

Meanwhile, I have the following workaround for you to try if you have time:

1) Assemble the input matrix (and also solution and rhs vector) on the root (main MPI process) so that iparm(40) = 0 can be used.

2) Distribute the matrix across MPI processes with intersections (so that some processes got rows in common), meaning that the ranges of [iparm(41); iparm(42)) will have an intersection across MPIs.

I am not 100% sure as I haven't checked them yet but I believe any one of these two should solve the problem. I'd try the first one.

I hope this helps.

Best,
Kirill

Marcos_V_1 · ‎09-10-2020

Good Morning Kirill,

Great to see the root cause of the error has been found. For us it doesn't make much sense to build the global Poisson matrix in Process 0 as it doesn't have information of other meshes held by other processes.

We will have to wait for the fix and new release of MKL. Thank you very much for your time and attention.

Best,

Marcos

Kirill_V_Intel · ‎09-10-2020

Hi Marcos,

I totally understand that it can be unnatural from the perspective of assembling pieces of discretization. What I suggest is to write a small code which will organize MPI communications between processes to form the matrix on the MPI root process.

I guess we can provide such a snippet from our side if needed (this would need a communication outside of this forum). It will take local CSR matrix on each process and assemble the global matrix on the root via MPI.

The rationale of this suggestion is to make it possible for you to not wait on the next release.

Let us know if you think it will help you proceed with your project faster.

Thanks,
Kirill

Marcos_V_1 · ‎09-11-2020

Hi Kirill, thank you very much for the offer. I would not worry about this, even though it would be interesting personally to see how the comm is setup to send back the Matrices to 0.

I think we can wait for the next MKL release, noting that if doing tests with -check_mpi we don't want to use the cluster solver (we have other non-MKL Poisson solver based in Fishpack which is the default). This is a new flag we are using as we learn to use the tracer tool, but it is not yet set in our targets being compiled in our nightly builds/continuous integration.

Again thank you, and best regards

Marcos

Marcos_V_1 · ‎11-19-2020

Dear Kirill and Gennady, do you know if there have been any updates on this issue?

Thank you,

Marcos

Kirill_V_Intel · ‎11-19-2020

Hi Marcos!

The fix should become available in oneMKL 2021 Gold release which is going to be released soon AFAIK.

Best,
Kirill

Kirill_V_Intel · ‎11-26-2020

A correction to my previous reply: the fix is already available in MKL 2020u4 (and will also be a part of oneMKL 2021.1, that part was correct).

Gennady_F_Intel · ‎12-02-2020

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Marcos_V_1 · ‎12-02-2020

Thank you Gennady and Kirill.

Have a great day,

Marcos

Marcos_V_1 · ‎12-02-2020

Hi Gennady, I'm seeing another issue. If you run the posted self contained program compiled with the -check_mpi flag and Update 4, it goes through the numerical factorization successfully but after 1500 solves the program crashes with a PMPI_Comm_free() error, see below:

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

Starting Program ...

MPI Process 0 started on blaze002.backend
MPI Process 1 started on blaze002.backend
MPI Process 2 started on blaze002.backend
MPI Process 3 started on blaze002.backend
MPI Process 4 started on blaze002.backend
MPI Process 5 started on blaze002.backend
MPI Process 6 started on blaze002.backend
MPI Process 7 started on blaze002.backend
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
[6] ERROR: Unexpected MPI error, aborting:
[6] ERROR: Invalid communicator, error stack:
[6] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0xa343e90) failed
[6] ERROR: PMPI_Comm_free(85).: Null communicator
[7] ERROR: Unexpected MPI error, aborting:
[7] ERROR: Invalid communicator, error stack:
[7] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0x9dd1e20) failed
[7] ERROR: PMPI_Comm_free(85).: Null communicator
Abort(1) on node 7 (rank 7 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7

Could you guys see if you can reproduce this new issue on your side? This is a linux machine (cluster) as described in the post.

Thank you,

Marcos

Gennady_F_Intel · ‎12-02-2020

Ok, we will check asap

Gennady_F_Intel · ‎12-02-2020

I see no issues with MKL 2020 u4. I see > 20000 steps were done successfully and I stopped the execution.

[gfedorov@cerberos test]$ mpirun -n 8 ./css_test

Starting Program ...

MPI Process 0 started on cerberos

....

MPI Process 7 started on cerberos

Into factorization Phase..

OMP: Info #274: omp_get_nested routine deprecated, please use omp_get_max_active_levels instead.

OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Into solve Phase..

NSOLVES = 100

NSOLVES = 200

NSOLVES = 300

NSOLVES = 400

NSOLVES = 500

……………………

…………………….

NSOLVES = 20700

NSOLVES = 20800

NSOLVES = 20900

[mpiexec@cerberos] Sending Ctrl-C to processes as requested

Gennady_F_Intel · ‎12-02-2020

mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2018 Build 20170713 (id: 17594)

Which MPI version do You use?