- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Gennady and Kirill,
We've come across an error trying to use the tracer tool to debug the MPI section of our code using the -check_mpi linking flag. The error happens within the first call to cluster_sparse_solver (Symbolic factorization). We get an error for collective SIZE mismatch in a call to MPI_Gatherv from MKLMPI_Gatherv. We've noted this also in our main source code (FDS) in Linux also using IMPI, also Parallel Studio XE 2020 u1.
I used our demonstration code the solver an 8 MPI process Poisson problem using the cluster_sparse_solver to verify the find. Use the tarball attached and follow the instructions in the README:
1. type: $ source /opt/intel20/parallel_studio_xe_2020/psxevars.sh
2. make a test dir in the same level as the source/ directory extracted
3. In source/ execute the make_test.sh to compile
4. In test/ run in test the css_test program with 8 MPI procs.
Any help on why this is coming up would ge gratly appreciated.
Thank you for your time and attention.
Marcos
PS: Here is the std error:
[~test]$ mpirun -n 8 ./css_test
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20
Starting Program ...
MPI Process 0 started on blaze.el.nist.gov
MPI Process 1 started on blaze.el.nist.gov
MPI Process 2 started on blaze.el.nist.gov
MPI Process 3 started on blaze.el.nist.gov
MPI Process 4 started on blaze.el.nist.gov
MPI Process 5 started on blaze.el.nist.gov
MPI Process 6 started on blaze.el.nist.gov
MPI Process 7 started on blaze.el.nist.gov
Into factorization Phase..
[0] ERROR: GLOBAL:COLLECTIVE:SIZE_MISMATCH: error
[0] ERROR: Mismatch found in local rank [0] (global rank [0]),
[0] ERROR: other processes may also be affected.
[0] ERROR: Root expects 442368 items but 110592 sent by local rank [0] (same as global rank):
[0] ERROR: MPI_Gatherv(*sendbuf=0x2b6882aac240, sendcount=110592, sendtype=MPI_INT, *recvbuf=0x2b6882f64080, *recvcounts=0xa4f5c80, *displs=0xa4f5d00, recvtype=MPI_INT, root=0, comm=0xffffffffc4000000 SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: No problem found in the 7 processes with local ranks [1:7] (same as global ranks):
[0] ERROR: MPI_Gatherv(*sendbuf=..., sendcount=110592, sendtype=MPI_INT, *recvbuf=..., *recvcounts=..., *displs=..., recvtype=MPI_INT, root=0, comm=... SPLIT COMM_WORLD [0:7])
[0] ERROR: MKLMPI_Gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cpardiso_mpi_gatherv (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_assemble_csr_full (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: mkl_pds_lp64_cluster_sparse_solver (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: MAIN__ (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/source/main.f90:269)
[0] ERROR: main (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] ERROR: __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR: (/home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST_CHECKMPI/test/css_test)
[0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting
[0] WARNING: starting premature shutdown
[0] INFO: GLOBAL:COLLECTIVE:SIZE_MISMATCH: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.
....
.....
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good morning Gennady, I understand it is the MPI library that comes with update 4. Please confirm:
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923 (id: abd58e492)
Copyright 2003-2020, Intel Corporation.
$ which mpirun
/opt/intel20/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpirun
Thank you,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marcos.
yes, I checked the problem with 2 years old version of MPI. We will check with the current ( latest) one and get back to this thread asap.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Markos,
yes, I have to confirm that I see exactly the same issue that you reported the last time after 1500 steps. We have to investigate the cause of the issue and will keep this thread updated.
-Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady, any updates on this last issue?
Thank you,
Marcos
- Tags:
- Hi Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marcos,
The issue is unrelated to MKL and most likely MPI, in fact. It has been reproduced with a simple code which does a lot of coomunicator split and free operations (MPI_Comm_split and MPI_Comm_free). The failure happens due to the ITAC (I suspect some internal book keeping for the MPI communicators has an internal size limit which gets exceeded).
ITAC team has been informed about the issue, me or Gennady will make an update once there is a response there.
Unfortunately, I cannot suggest a reliable workaround except not using ITAC temporarily or doing less solving calls / using fewer MPI processes (which would only defer the failure for some time I guess).
Best,
Kirill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kirill, thank you for the update. We are eager to be able to add the mpi checking option to our development compilation targets and make it part of our software development workflow.
What is the ITAC?
Thank you,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Marcos,
ITAC means Intel Trace Analyzer and Collector team. It actually is part of MPI team.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marcos,
Thanks for your patience. The issue raised by you has been fixed in the MKL 2020.4 version. Please download the latest OneAPI version (2021.2) for the latest oneMKL version and let us know your experience with it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good morning Prasanth, the original issue with MKL has been resolved. The second issue on the same test case related to using the -check_mpi flag in the linking of the test problem seems to still be there (Gennady commented it is related to MPI not MKL). I can still reproduce it with oneAPI update 2.
To see the issue, compile the code adding -check_mpi to the link flags in make_test.sh and run the case as explained. I still get the following:
[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20
Starting Program ...
MPI Process 0 started on burn034
MPI Process 1 started on burn034
MPI Process 2 started on burn034
MPI Process 3 started on burn034
MPI Process 4 started on burn034
MPI Process 5 started on burn034
MPI Process 6 started on burn034
MPI Process 7 started on burn034
Into factorization Phase..
Into solve Phase..
NSOLVES = 100
NSOLVES = 200
NSOLVES = 300
NSOLVES = 400
NSOLVES = 500
NSOLVES = 600
NSOLVES = 700
NSOLVES = 800
NSOLVES = 900
NSOLVES = 1000
NSOLVES = 1100
NSOLVES = 1200
NSOLVES = 1300
NSOLVES = 1400
NSOLVES = 1500
[6] ERROR: Unexpected MPI error, aborting:
[6] ERROR: Invalid communicator, error stack:
[6] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0xa702140) failed
[6] ERROR: PMPI_Comm_free(85).: Null communicator
[7] ERROR: Unexpected MPI error, aborting:
[7] ERROR: Invalid communicator, error stack:
[7] ERROR: PMPI_Comm_free(137): MPI_Comm_free(comm=0xa9c30d0) failed
[7] ERROR: PMPI_Comm_free(85).: Null communicator
Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
Abort(1) on node 7 (rank 7 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
~
Thank you,
Marcos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marcos,
Sorry for the delay in response.
Yes the MKL issue has been resolved in MKL 2020.4 version whereas the ITAC issue fix was not available as of now and the fix will be available in the next version ITAC 2021.3 which I believe to be released as part of OneAPI 2021.3.
Let us know if we can close this thread for now.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, close the thread. I'll open a new one if there is any issue with the new release. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the confirmation.
As your issue has been resolved, we are closing this thread. We will no longer respond to this thread. If you find any issues in the latest version, please start a new thread. Any further interaction in this thread will be considered community only
Regards
Prasanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »