- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Gennady and Kirill,
We've come across an error trying to use the tracer tool to debug the MPI section of our code using the -check_mpi linking flag. The error happens within the first call to cluster_sparse_solver (Symbolic factorization). We get an error for collective SIZE mismatch in a call to MPI_Gatherv from MKLMPI_Gatherv. We've noted this also in our main source code (FDS) in Linux also using IMPI, also Parallel Studio XE 2020 u1.
I used our demonstration code the solver an 8 MPI process Poisson problem using the cluster_sparse_solver to verify the find. Use the tarball attached and follow the instructions in the README:
1. type: $ source /opt/intel20/parallel_studio_xe_2020/psxevars.sh
2. make a test dir in the same level as the source/ directory extracted
3. In source/ execute the make_test.sh to compile
4. In test/ run in test the css_test program with 8 MPI procs.
Any help on why this is coming up would ge gratly appreciated.
Thank you for your time and attention.
Marcos
PS: Here is the std error:
[~test]$ mpirun -n 8 ./css_test
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20
Starting Program ...
MPI Process 0 started on blaze.el.nist.gov
MPI Process 1 started on blaze.el.nist.gov
MPI Process 2 started on blaze.el.nist.gov
MPI Process 3 started on blaze.el.nist.gov
MPI Process 4 started on blaze.el.nist.gov
MPI Process 5 started on blaze.el.nist.gov
MPI Process 6 started on blaze.el.nist.gov
MPI Process 7 started on blaze.el.nist.gov
Into factorization Phase..
[0] ERROR: GLOBAL:COLLECTIVE:SIZE_MISMATCH: error
[0] ERROR: Mismatch found in local rank [0] (global rank [0]),
[0] ERROR: other processes may also be affected.
[0] ERROR: Root expects 442368 items but 110592 sent by local rank [0] (same as global rank):
[0] ERROR: MPI_Gatherv(*sendbuf=0x2b6882aac240, sendcount=110592, sendtype=MPI_INT, *recvbuf=0x2b6882f64080, *recvcounts=0xa4f5c80, *displs=0xa4f5d00, recvtype=MPI_INT, root=0, comm=0xffffffffc4000000 SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: No problem found in the 7 processes with local ranks [1:7] (same as global ranks):
[0] ERROR: MPI_Gatherv(*sendbuf=..., sendcount=110592, sendtype=MPI_INT, *recvbuf=..., *recvcounts=..., *displs=..., recvtype=MPI_INT, root=0, comm=... SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting
[0] WARNING: starting premature shutdown
[0] INFO: GLOBAL:COLLECTIVE:SIZE_MISMATCH: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.
....
.....
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Marcos,
Just a quick question while I'm looking for the PSXE at my disposal: do you see any falures when you don't use the trace analyzer and collector?
Thanks,
Kirill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Morning Kirill, thank you for looking into this. I actually also see the error only invoking the -check_mpi linking flag when compiling, without sourcing psxevars.sh.
So, just compiling and running css_test you should be able to reproduce the error.
Thank yo for your time, best
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, what I meant by this is running the compiled css_test with -check_mpi and pxevars.sh sourced in a terminal where psxevars.sh has not been sourced. It probably is the same situation as having sourced psxevar.sh.
In order to be able to compile with -check_mpi you need to source psxevars.sh. Without the flag the code runs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
compiling and running your example without -check_mpi,
I see no problems on my end:
Starting Program ...
MPI Process 0 started on cerberos
MPI Process 1 started on cerberos
MPI Process 2 started on cerberos
MPI Process 6 started on cerberos
MPI Process 7 started on cerberos
MPI Process 3 started on cerberos
MPI Process 4 started on cerberos
MPI Process 5 started on cerberos
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
NSOLVES = 1600
NSOLVES = 1700
NSOLVES = 1800
NSOLVES = 1900
NSOLVES = 2000
NSOLVES = 2100
NSOLVES = 2200
NSOLVES = 2300
NSOLVES = 2400
......
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady, correct. The error comes with compiling with the -check_mpi flag (previously sourcing psxvars.sh).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I confirm the issue. The test fails when it is run with -check_mpi as Marcos described (I believe the Trace analyzer and collector forces the stop). The reported size mismatch needs to be investigated.
Best,
Kirill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue is escalated and this thread would be keep being updated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Marcos,
The root cause is a bug in how the distributed CSR matrix is assembled inside the cluster sparse solver. We'll fix it.
Meanwhile, I have the following workaround for you to try if you have time:
1) Assemble the input matrix (and also solution and rhs vector) on the root (main MPI process) so that iparm(40) = 0 can be used.
2) Distribute the matrix across MPI processes with intersections (so that some processes got rows in common), meaning that the ranges of [iparm(41); iparm(42)) will have an intersection across MPIs.
I am not 100% sure as I haven't checked them yet but I believe any one of these two should solve the problem. I'd try the first one.
I hope this helps.
Best,
Kirill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good Morning Kirill,
Great to see the root cause of the error has been found. For us it doesn't make much sense to build the global Poisson matrix in Process 0 as it doesn't have information of other meshes held by other processes.
We will have to wait for the fix and new release of MKL. Thank you very much for your time and attention.
Best,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marcos,
I totally understand that it can be unnatural from the perspective of assembling pieces of discretization. What I suggest is to write a small code which will organize MPI communications between processes to form the matrix on the MPI root process.
I guess we can provide such a snippet from our side if needed (this would need a communication outside of this forum). It will take local CSR matrix on each process and assemble the global matrix on the root via MPI.
The rationale of this suggestion is to make it possible for you to not wait on the next release.
Let us know if you think it will help you proceed with your project faster.
Thanks,
Kirill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kirill, thank you very much for the offer. I would not worry about this, even though it would be interesting personally to see how the comm is setup to send back the Matrices to 0.
I think we can wait for the next MKL release, noting that if doing tests with -check_mpi we don't want to use the cluster solver (we have other non-MKL Poisson solver based in Fishpack which is the default). This is a new flag we are using as we learn to use the tracer tool, but it is not yet set in our targets being compiled in our nightly builds/continuous integration.
Again thank you, and best regards
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Kirill and Gennady, do you know if there have been any updates on this issue?
Thank you,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marcos!
The fix should become available in oneMKL 2021 Gold release which is going to be released soon AFAIK.
Best,
Kirill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A correction to my previous reply: the fix is already available in MKL 2020u4 (and will also be a part of oneMKL 2021.1, that part was correct).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Gennady and Kirill.
Have a great day,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady, I'm seeing another issue. If you run the posted self contained program compiled with the -check_mpi flag and Update 4, it goes through the numerical factorization successfully but after 1500 solves the program crashes with a PMPI_Comm_free() error, see below:
[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20
Starting Program ...
MPI Process 0 started on blaze002.backend
MPI Process 1 started on blaze002.backend
MPI Process 2 started on blaze002.backend
MPI Process 3 started on blaze002.backend
MPI Process 4 started on blaze002.backend
MPI Process 5 started on blaze002.backend
MPI Process 6 started on blaze002.backend
MPI Process 7 started on blaze002.backend
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
[6] ERROR: Unexpected MPI error, aborting:
[6] ERROR: Invalid communicator, error stack:
[6] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0xa343e90) failed
[6] ERROR: PMPI_Comm_free(85).: Null communicator
[7] ERROR: Unexpected MPI error, aborting:
[7] ERROR: Invalid communicator, error stack:
[7] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0x9dd1e20) failed
[7] ERROR: PMPI_Comm_free(85).: Null communicator
Abort(1) on node 7 (rank 7 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
Could you guys see if you can reproduce this new issue on your side? This is a linux machine (cluster) as described in the post.
Thank you,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see no issues with MKL 2020 u4. I see > 20000 steps were done successfully and I stopped the execution.
[gfedorov@cerberos test]$ mpirun -n 8 ./css_test
Starting Program ...
MPI Process 0 started on cerberos
....
MPI Process 7 started on cerberos
Into factorization Phase..
OMP: Info #274: omp_get_nested routine deprecated, please use omp_get_max_active_levels instead.
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
……………………
…………………….
NSOLVES = 20700
NSOLVES = 20800
NSOLVES = 20900
[mpiexec@cerberos] Sending Ctrl-C to processes as requested
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2018 Build 20170713 (id: 17594)
Which MPI version do You use?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page