Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
144 Views

MKL : Error running cluster_sparse_solver with -check_mpi file and tracer in Linux, PS XE 2020.

Dear Gennady and Kirill,

We've come across an error trying to use the tracer tool to debug the MPI section of our code using the -check_mpi linking flag. The error happens within the first call to cluster_sparse_solver (Symbolic factorization). We get an error for collective SIZE mismatch in a call to MPI_Gatherv from MKLMPI_Gatherv. We've noted this also in our main source code (FDS) in Linux also using IMPI, also Parallel Studio XE 2020 u1.

I used our demonstration code the solver an 8 MPI process Poisson problem using the cluster_sparse_solver to verify the find. Use the tarball attached and follow the instructions in the README: 

1. type: $ source /opt/intel20/parallel_studio_xe_2020/psxevars.sh

2. make a test dir in the same level as the source/ directory extracted

3. In source/ execute the make_test.sh to compile

4. In test/ run in test the css_test program with 8 MPI procs.

Any help on why this is coming up would ge gratly appreciated. 

Thank you for your time and attention.

Marcos

 

PS: Here is the std error:

[~test]$ mpirun -n 8 ./css_test

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20


Starting Program ...

MPI Process 0 started on blaze.el.nist.gov
MPI Process 1 started on blaze.el.nist.gov
MPI Process 2 started on blaze.el.nist.gov
MPI Process 3 started on blaze.el.nist.gov
MPI Process 4 started on blaze.el.nist.gov
MPI Process 5 started on blaze.el.nist.gov
MPI Process 6 started on blaze.el.nist.gov
MPI Process 7 started on blaze.el.nist.gov
Into factorization Phase..

[0] ERROR: GLOBAL:COLLECTIVE:SIZE_MISMATCH: error
[0] ERROR: Mismatch found in local rank [0] (global rank [0]),
[0] ERROR: other processes may also be affected.
[0] ERROR: Root expects 442368 items but 110592 sent by local rank [0] (same as global rank):
[0] ERROR: MPI_Gatherv(*sendbuf=0x2b6882aac240, sendcount=110592, sendtype=MPI_INT, *recvbuf=0x2b6882f64080, *recvcounts=0xa4f5c80, *displs=0xa4f5d00, recvtype=MPI_INT, root=0, comm=0xffffffffc4000000 SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: No problem found in the 7 processes with local ranks [1:7] (same as global ranks):
[0] ERROR: MPI_Gatherv(*sendbuf=..., sendcount=110592, sendtype=MPI_INT, *recvbuf=..., *recvcounts=..., *displs=..., recvtype=MPI_INT, root=0, comm=... SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting
[0] WARNING: starting premature shutdown

[0] INFO: GLOBAL:COLLECTIVE:SIZE_MISMATCH: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

....

.....

0 Kudos
11 Replies
Highlighted
Employee
135 Views

Hello Marcos,

Just a quick question while I'm looking for the PSXE at my disposal: do you see any falures when you don't use the trace analyzer and collector?

Thanks,
Kirill

0 Kudos
Highlighted
Beginner
109 Views

Morning Kirill, thank you for looking into this. I actually also see the error only invoking the -check_mpi linking flag when compiling, without sourcing psxevars.sh.

So, just compiling and running css_test you should be able to reproduce the error.

Thank yo for your time, best

 

Marcos

0 Kudos
Highlighted
Beginner
106 Views

Sorry, what I meant by this is running the compiled css_test with -check_mpi and pxevars.sh sourced in a terminal where psxevars.sh has not been sourced. It probably is the same situation as having sourced psxevar.sh.

In order to be able to compile with -check_mpi you need to source psxevars.sh. Without the flag the code runs.

0 Kudos
Highlighted
Moderator
100 Views

compiling and running your example without -check_mpi,

I see no problems on my end: 

Starting Program ...

MPI Process 0 started on cerberos
MPI Process 1 started on cerberos
MPI Process 2 started on cerberos
MPI Process 6 started on cerberos
MPI Process 7 started on cerberos
MPI Process 3 started on cerberos
MPI Process 4 started on cerberos
MPI Process 5 started on cerberos
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
NSOLVES = 1600
NSOLVES = 1700
NSOLVES = 1800
NSOLVES = 1900
NSOLVES = 2000
NSOLVES = 2100
NSOLVES = 2200
NSOLVES = 2300
NSOLVES = 2400

......

0 Kudos
Highlighted
Beginner
93 Views

Hi Gennady, correct. The error comes with compiling with the -check_mpi flag (previously sourcing psxvars.sh).

 

 

0 Kudos
Highlighted
Employee
89 Views

Hi all,

I confirm the issue. The test fails when it is run with -check_mpi as Marcos described (I believe the Trace analyzer and collector forces the stop). The reported size mismatch needs to be investigated.

Best,
Kirill

0 Kudos
Highlighted
Moderator
82 Views

The issue is escalated and this thread would be keep being updated.


0 Kudos
Highlighted
Employee
66 Views

Hello Marcos,

The root cause is a bug in how the distributed CSR matrix is assembled inside the cluster sparse solver. We'll fix it. 

Meanwhile, I have the following workaround for you to try if you have time:

1) Assemble the input matrix (and also solution and rhs vector) on the root (main MPI process) so that iparm(40) = 0 can be used.

2) Distribute the matrix across MPI processes with intersections (so that some processes got rows in common), meaning that the ranges of [iparm(41); iparm(42)) will have an intersection across MPIs.

I am not 100% sure as I haven't checked them yet but I believe any one of these two should solve the problem. I'd try the first one.

I hope this helps.

Best,
Kirill

0 Kudos
Highlighted
Beginner
47 Views

Good Morning Kirill,

Great to see the root cause of the error has been found. For us it doesn't make much sense to build the global Poisson matrix in Process 0 as it doesn't have information of other meshes held by other processes.

We will have to wait for the fix and new release of MKL. Thank you very much for your time and attention.

Best,

Marcos

0 Kudos
Highlighted
Employee
39 Views

Hi Marcos,

I totally understand that it can be unnatural from the perspective of assembling pieces of discretization. What I suggest is to write a small code which will organize MPI communications between processes to form the matrix on the MPI root process.

I guess we can provide such a snippet from our side if needed (this would need a communication outside of this forum). It will take local CSR matrix on each process and assemble the global matrix on the root via MPI. 

The rationale of this suggestion is to make it possible for you to not wait on the next release.

Let us know if you think it will help you proceed with your project faster.

Thanks,
Kirill

0 Kudos
Highlighted
Beginner
30 Views

Hi Kirill, thank you very much for the offer. I would not worry about this, even though it would be interesting personally to see how the comm is setup to send back the Matrices to 0.

I think we can wait for the next MKL release, noting that if doing tests with -check_mpi we don't want to use the cluster solver (we have other non-MKL Poisson solver based in Fishpack which is the default). This is a new flag we are using as we learn to use the tracer tool, but it is not yet set in our targets being compiled in our nightly builds/continuous integration.

Again thank you, and best regards

Marcos

 

0 Kudos