Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

-check_mpi causes code to get stuck in MPI_FINALIZE

Kevin_McGrattan
6,030 Views

I am using the oneAPI "latest" version of Intel MPI, Fortran on a linux cluster. Things are working fine. However, to check my MPI calls, I added -check_mpi to my link step and ran a simple case. The mpi checking works, but the program hangs in MPI_FINALIZE. If I compile without -check_mpi, it does not hang in MPI_FINALIZE. With or without -check_mpi, the calculation runs fine. It just gets stuck in MPI_FINALIZE with -check_mpi.

I did some searching and there are numerous posts about calculations getting stuck in MPI_FINALIZE, regardless of the -check_mpi. The response to the reports is usually to ensure that all communications have completed. However in my case, that is exactly what I want the check_mpi flag to tell me. I don't think that there are outstanding communications, but who knows. Is there a way I can force my way out of MPI_FINALIZE or prompt it to provide me a coherent error message?

Labels (1)
0 Kudos
1 Solution
James_T_Intel
Moderator
5,592 Views

Short version: I_MPI_FABRICS=shm will use the Intel® MPI Library shared memory implementation, FI_PROVIDER=shm will use the libfabric shared memory implementation.


I_MPI_FABRICS is used to set the communication provider used by Intel® MPI Library. In older versions, this was the primary mechanism for specifying the interconnect. Starting with 2019, this was modified along with other major internal changes to run all inter-node communications through libfabric. Now, there are three options for I_MPI_FABRICS. shm (shared memory only, only valid for a single-node run), ofi (libfabric only), and shm:ofi (shared memory for intranode, libfabric for internode).


FI_PROVIDER sets the provider to be used by libfabric. By choosing shm here, we will still go through libfabric, and libfabric will use its own shared memory communications. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/fabrics-control/ofi-providers-support.html for our documentation regarding provider selection and https://github.com/ofiwg/libfabric for full details on libfabric.


View solution in original post

0 Kudos
32 Replies
PrasanthD_intel
Moderator
4,112 Views

Hi Kevin,


Could you please provide the command line you were using to launch MPI?

If it doesn't contain the number of nodes you were launching on, please mention that too.


Regards

Prasanth


0 Kudos
Kevin_McGrattan
4,105 Views

If I run the job directly from the command line on the head node

mpiexec -n 1 <executable> <input_file.txt>

the job runs fine. It's just a single process MPI job, in this case, for simplicity.

However, I typically run jobs via a SLURM script:

#!/bin/bash
...
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1

module purge
module load ... tbb/latest compiler-rt/latest dpl/latest mpi/latest psm

module load libfabric/1.10.1

export OMP_NUM_THREADS=1
export I_MPI_DEBUG=5
export FI_PROVIDER=shm

srun -N 1 -n 1 --ntasks-per-node 1 <executable> <input_file.txt>

I wonder if this has to do with the psm libfabric, which we use because we have old Qlogic Infiniband cards. Or it could have to do with SLURM, srun, etc.

0 Kudos
Kevin_McGrattan
4,098 Views

More info: I ran this same simple case on another linux cluster that uses Mellanox cards and does not use the psm libfabric. The case runs successfully there. So I suspect that this hanging in MPI_FINALIZE is not related to SLURM, but rather psm. Our Qlogic cards are sufficiently old that we had to build the psm lib ourselves. Can you think of a reason for hanging in MPI_FINALIZE? Could it be that in this case we are only using intranode (shm) communications?

0 Kudos
PrasanthD_intel
Moderator
4,086 Views

Hi Kevin,


Instead of srun could you once try and check with mpiexec.hydra for launching mpi?


Regards

Prasanth



0 Kudos
Kevin_McGrattan
4,076 Views

I have discovered that srun and SLURM are not the problem. The problem occurs with the psm libfabric that we use on one of our linux clusters because it uses Qlogic Infiniband cards. So basically we are using an old fabric with old cards, and maybe this is just a consequence of that. However, if you can think of a way to force the code to exit MPI_FINALIZE or think of some way to compile and link that would solve the problem, I would appreciate it.

0 Kudos
PrasanthD_intel
Moderator
4,047 Views

Hi Kevin,


Sorry for the delay in response. Could you please provide the model name and any additional information regarding your Qlogic Infiniband?


Regards

Prasanth


0 Kudos
Kevin_McGrattan
4,040 Views

Centos 7 linux using the latest oneAPI Fortran and MPI

 

$ ibstat
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x00117500006fcc26
System image GUID: 0x00117500006fcc26
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x07690868
Port GUID: 0x00117500006fcc26
Link layer: InfiniBand

0 Kudos
PrasanthD_intel
Moderator
4,005 Views

Hi Kevin,

 

Thanks for being patient, we are sorry for the delay.

I am escalating this thread to an SME (Subject Matter Expert).

We will get back to you soon.

 

Regards

Prasanth

0 Kudos
James_T_Intel
Moderator
3,996 Views

You mentioned that you have confirmed this is related to the QLogic hardware. Can you specify another device you tested where it works?


Please check if you get the same hang using -trace instead of -check_mpi.


Do you see the same hang on a simple Hello World code with -check_mpi on the QLogic hardware.


0 Kudos
Kevin_McGrattan
3,988 Views

We have two linux clusters, both configured more or less the same except that one uses Qlogic/psm (qib0) and the other Mellanox/ofi (mlx4_0). The hang-up in MPI_FINALIZE occurs on the Qlogic system. It occurs when I use -check_mpi. It does not occur when I use -trace. I cannot reproduce the problem with a simple Hello_World problem. 

Is there a way I can get information from MPI_FINALIZE that might give a hint as to something I am doing that is not appropriate. I do not get any errors or warnings from the -check_mpi option. The calculations finish fine, but they do not get released and remain running.

0 Kudos
James_T_Intel
Moderator
3,970 Views

For most errors, the message checker will print output immediately. If you have requests left open, those are printed at the end.


Can you attach a debugger and identify where the hang occurs?


Also, have you encountered this in an earlier version?


0 Kudos
Kevin_McGrattan
3,963 Views

The code enters MPI_FINALIZE and never returns, even with only a single MPI process running. This only happens when I use -check_mpi. If I do not use -check_mpi, everything works properly. However, the point of using -check_mpi is to see if there is a problem with my MPI calls. I haven't encountered this before because I have only now started using the -check_mpi option. In general, this option has identified a few non-kosher MPI calls which I have fixed. But I want to use the -check_mpi as part of our routine continuous integration process but I cannot because the jobs hang in the MPI_FINALIZE call. 

So my question to you is this --- is there a time-out parameter that would force the code to exit MPI_FINALIZE and tell me if I have done something non-kosher within the code?

0 Kudos
James_T_Intel
Moderator
3,956 Views

There is a VT_DEADLOCK_TIMEOUT option you can set. The default is 1 minute.


0 Kudos
Kevin_McGrattan
3,952 Views

My calculations remain deadlocked in MPI_FINALIZE indefinitely. The job never ends because it is stuck in the second to last line of the code:

CALL MPI_FINALIZE

END PROGRAM

If the code never ends, the cluster cores are never released, and I cannot run a suite of test cases automatically. 

0 Kudos
Kevin_McGrattan
3,945 Views

The MPI standard only requires that rank 0 return from MPI_FINALIZE.  From version 3.0 of the standard in Chapter 8, section 8.7:

Although it is not required that all processes return from MPI_FINALIZE, it is required that at least process 0 in MPI_COMM_WORLD return, so that users can know that the MPI portion of the computation is over. In addition, in a POSIX environment, users may desire to supply an exit code for each process that returns from MPI_FINALIZE.

So this is what I need to do -- exit MPI_FINALIZE with some sort of error code.

0 Kudos
James_T_Intel
Moderator
3,871 Views

Please run with VT_VERBOSE=5 and attach the output. Also, please get the stack of the hanging process with gstack <pid>.


0 Kudos
Kevin_McGrattan
3,848 Views

I have attached the output of the gstack comment. The VT_VERBOSE output is extensive, but it appears that the bottom line is that it says I have not freed a datatype. I checked, and the only MPI datatype that I create, I free with MPI_TYPE_FREE.

 

[1 Wed Feb 24 14:52:42 2021] WARNING: LOCAL:DATATYPE:NOT_FREED: warning
[1 Wed Feb 24 14:52:42 2021] WARNING: When calling MPI_Finalize() there were unfreed user-defined datatypes:
[1 Wed Feb 24 14:52:42 2021] WARNING: 1 in this process.
[1 Wed Feb 24 14:52:42 2021] WARNING: This may indicate that resources are leaked at runtime.
[1 Wed Feb 24 14:52:42 2021] WARNING: To clean up properly MPI_Type_free() should be called for
[1 Wed Feb 24 14:52:42 2021] WARNING: all user-defined datatypes.
[1 Wed Feb 24 14:52:42 2021] WARNING: 1. 1 time:
[1 Wed Feb 24 14:52:42 2021] WARNING: mpi_type_create_struct_(count=2, *array_of_blocklens=0x7ffebc055cc0, *array_of_displacements=0x7ffebc055c90, *array_of_types=0x7ffebc055cb0, *newtype=0x7ffebc055b28, *ierr=0xda99900)
[1 Wed Feb 24 14:52:42 2021] WARNING: fds_IP_exchange_diagnostics_ (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/../../Source/main.f90:3510)
[1 Wed Feb 24 14:52:42 2021] WARNING: MAIN__ (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/../../Source/main.f90:922)
[1 Wed Feb 24 14:52:42 2021] WARNING: main (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/fds_impi_intel_linux_64_db)
[1 Wed Feb 24 14:52:42 2021] WARNING: __libc_start_main (/usr/lib64/libc-2.17.so)
[1 Wed Feb 24 14:52:42 2021] WARNING: (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/fds_impi_intel_linux_64_db)
[0 Wed Feb 24 14:52:43 2021] INFO: "logging": internal info...
[0 Wed Feb 24 14:52:43 2021] INFO: "logging": communicators...
[1 Wed Feb 24 14:52:43 2021] INFO: "logging": internal info...
[1 Wed Feb 24 14:52:43 2021] INFO: "logging": communicators...

0 Kudos
James_T_Intel
Moderator
3,842 Views

Please go ahead and attach the output with VT_VERBOSE=1, I will provide both of these to our development team for analysis.


0 Kudos
Kevin_McGrattan
3,835 Views

The file I have attached is standard error. It does not contain much info with VT_VERBOSE=1. Is this the file you meant?

0 Kudos
James_T_Intel
Moderator
3,800 Views

Sorry, I meant the VT_VERBOSE=5 output.


0 Kudos
Reply