I am using the oneAPI "latest" version of Intel MPI, Fortran on a linux cluster. Things are working fine. However, to check my MPI calls, I added -check_mpi to my link step and ran a simple case. The mpi checking works, but the program hangs in MPI_FINALIZE. If I compile without -check_mpi, it does not hang in MPI_FINALIZE. With or without -check_mpi, the calculation runs fine. It just gets stuck in MPI_FINALIZE with -check_mpi.
I did some searching and there are numerous posts about calculations getting stuck in MPI_FINALIZE, regardless of the -check_mpi. The response to the reports is usually to ensure that all communications have completed. However in my case, that is exactly what I want the check_mpi flag to tell me. I don't think that there are outstanding communications, but who knows. Is there a way I can force my way out of MPI_FINALIZE or prompt it to provide me a coherent error message?
Short version: I_MPI_FABRICS=shm will use the Intel® MPI Library shared memory implementation, FI_PROVIDER=shm will use the libfabric shared memory implementation.
I_MPI_FABRICS is used to set the communication provider used by Intel® MPI Library. In older versions, this was the primary mechanism for specifying the interconnect. Starting with 2019, this was modified along with other major internal changes to run all inter-node communications through libfabric. Now, there are three options for I_MPI_FABRICS. shm (shared memory only, only valid for a single-node run), ofi (libfabric only), and shm:ofi (shared memory for intranode, libfabric for internode).
FI_PROVIDER sets the provider to be used by libfabric. By choosing shm here, we will still go through libfabric, and libfabric will use its own shared memory communications. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/run... for our documentation regarding provider selection and https://github.com/ofiwg/libfabric for full details on libfabric.
Could you please provide the command line you were using to launch MPI?
If it doesn't contain the number of nodes you were launching on, please mention that too.
If I run the job directly from the command line on the head node
mpiexec -n 1 <executable> <input_file.txt>
the job runs fine. It's just a single process MPI job, in this case, for simplicity.
However, I typically run jobs via a SLURM script:
module load ... tbb/latest compiler-rt/latest dpl/latest mpi/latest psm
module load libfabric/1.10.1
srun -N 1 -n 1 --ntasks-per-node 1 <executable> <input_file.txt>
I wonder if this has to do with the psm libfabric, which we use because we have old Qlogic Infiniband cards. Or it could have to do with SLURM, srun, etc.
More info: I ran this same simple case on another linux cluster that uses Mellanox cards and does not use the psm libfabric. The case runs successfully there. So I suspect that this hanging in MPI_FINALIZE is not related to SLURM, but rather psm. Our Qlogic cards are sufficiently old that we had to build the psm lib ourselves. Can you think of a reason for hanging in MPI_FINALIZE? Could it be that in this case we are only using intranode (shm) communications?
I have discovered that srun and SLURM are not the problem. The problem occurs with the psm libfabric that we use on one of our linux clusters because it uses Qlogic Infiniband cards. So basically we are using an old fabric with old cards, and maybe this is just a consequence of that. However, if you can think of a way to force the code to exit MPI_FINALIZE or think of some way to compile and link that would solve the problem, I would appreciate it.
Sorry for the delay in response. Could you please provide the model name and any additional information regarding your Qlogic Infiniband?
Centos 7 linux using the latest oneAPI Fortran and MPI
CA type: InfiniPath_QLE7340
Number of ports: 1
Hardware version: 2
Node GUID: 0x00117500006fcc26
System image GUID: 0x00117500006fcc26
Physical state: LinkUp
Base lid: 2
SM lid: 1
Capability mask: 0x07690868
Port GUID: 0x00117500006fcc26
Link layer: InfiniBand
Thanks for being patient, we are sorry for the delay.
I am escalating this thread to an SME (Subject Matter Expert).
We will get back to you soon.
You mentioned that you have confirmed this is related to the QLogic hardware. Can you specify another device you tested where it works?
Please check if you get the same hang using -trace instead of -check_mpi.
Do you see the same hang on a simple Hello World code with -check_mpi on the QLogic hardware.
We have two linux clusters, both configured more or less the same except that one uses Qlogic/psm (qib0) and the other Mellanox/ofi (mlx4_0). The hang-up in MPI_FINALIZE occurs on the Qlogic system. It occurs when I use -check_mpi. It does not occur when I use -trace. I cannot reproduce the problem with a simple Hello_World problem.
Is there a way I can get information from MPI_FINALIZE that might give a hint as to something I am doing that is not appropriate. I do not get any errors or warnings from the -check_mpi option. The calculations finish fine, but they do not get released and remain running.
For most errors, the message checker will print output immediately. If you have requests left open, those are printed at the end.
Can you attach a debugger and identify where the hang occurs?
Also, have you encountered this in an earlier version?
The code enters MPI_FINALIZE and never returns, even with only a single MPI process running. This only happens when I use -check_mpi. If I do not use -check_mpi, everything works properly. However, the point of using -check_mpi is to see if there is a problem with my MPI calls. I haven't encountered this before because I have only now started using the -check_mpi option. In general, this option has identified a few non-kosher MPI calls which I have fixed. But I want to use the -check_mpi as part of our routine continuous integration process but I cannot because the jobs hang in the MPI_FINALIZE call.
So my question to you is this --- is there a time-out parameter that would force the code to exit MPI_FINALIZE and tell me if I have done something non-kosher within the code?
My calculations remain deadlocked in MPI_FINALIZE indefinitely. The job never ends because it is stuck in the second to last line of the code:
If the code never ends, the cluster cores are never released, and I cannot run a suite of test cases automatically.
The MPI standard only requires that rank 0 return from MPI_FINALIZE. From version 3.0 of the standard in Chapter 8, section 8.7:
Although it is not required that all processes return from MPI_FINALIZE, it is required that at least process 0 in MPI_COMM_WORLD return, so that users can know that the MPI portion of the computation is over. In addition, in a POSIX environment, users may desire to supply an exit code for each process that returns from MPI_FINALIZE.
So this is what I need to do -- exit MPI_FINALIZE with some sort of error code.
I have attached the output of the gstack comment. The VT_VERBOSE output is extensive, but it appears that the bottom line is that it says I have not freed a datatype. I checked, and the only MPI datatype that I create, I free with MPI_TYPE_FREE.
[1 Wed Feb 24 14:52:42 2021] WARNING: LOCAL:DATATYPE:NOT_FREED: warning
[1 Wed Feb 24 14:52:42 2021] WARNING: When calling MPI_Finalize() there were unfreed user-defined datatypes:
[1 Wed Feb 24 14:52:42 2021] WARNING: 1 in this process.
[1 Wed Feb 24 14:52:42 2021] WARNING: This may indicate that resources are leaked at runtime.
[1 Wed Feb 24 14:52:42 2021] WARNING: To clean up properly MPI_Type_free() should be called for
[1 Wed Feb 24 14:52:42 2021] WARNING: all user-defined datatypes.
[1 Wed Feb 24 14:52:42 2021] WARNING: 1. 1 time:
[1 Wed Feb 24 14:52:42 2021] WARNING: mpi_type_create_struct_(count=2, *array_of_blocklens=0x7ffebc055cc0, *array_of_displacements=0x7ffebc055c90, *array_of_types=0x7ffebc055cb0, *newtype=0x7ffebc055b28, *ierr=0xda99900)
[1 Wed Feb 24 14:52:42 2021] WARNING: fds_IP_exchange_diagnostics_ (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/../../Source/main.f90:3510)
[1 Wed Feb 24 14:52:42 2021] WARNING: MAIN__ (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/../../Source/main.f90:922)
[1 Wed Feb 24 14:52:42 2021] WARNING: main (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/fds_impi_intel_linux_64_db)
[1 Wed Feb 24 14:52:42 2021] WARNING: __libc_start_main (/usr/lib64/libc-2.17.so)
[1 Wed Feb 24 14:52:42 2021] WARNING: (/home4/mcgratta/firemodels/fds/Build/impi_intel_linux_64_db/fds_impi_intel_linux_64_db)
[0 Wed Feb 24 14:52:43 2021] INFO: "logging": internal info...
[0 Wed Feb 24 14:52:43 2021] INFO: "logging": communicators...
[1 Wed Feb 24 14:52:43 2021] INFO: "logging": internal info...
[1 Wed Feb 24 14:52:43 2021] INFO: "logging": communicators...