Solved: -check_mpi causes code to get stuck in MPI_FINALIZE - Page 2

Kevin_McGrattan · ‎01-18-2021

I am using the oneAPI "latest" version of Intel MPI, Fortran on a linux cluster. Things are working fine. However, to check my MPI calls, I added -check_mpi to my link step and ran a simple case. The mpi checking works, but the program hangs in MPI_FINALIZE. If I compile without -check_mpi, it does not hang in MPI_FINALIZE. With or without -check_mpi, the calculation runs fine. It just gets stuck in MPI_FINALIZE with -check_mpi.

I did some searching and there are numerous posts about calculations getting stuck in MPI_FINALIZE, regardless of the -check_mpi. The response to the reports is usually to ensure that all communications have completed. However in my case, that is exactly what I want the check_mpi flag to tell me. I don't think that there are outstanding communications, but who knows. Is there a way I can force my way out of MPI_FINALIZE or prompt it to provide me a coherent error message?

James_T_Intel · ‎02-26-2021

Short version: I_MPI_FABRICS=shm will use the Intel® MPI Library shared memory implementation, FI_PROVIDER=shm will use the libfabric shared memory implementation.

I_MPI_FABRICS is used to set the communication provider used by Intel® MPI Library. In older versions, this was the primary mechanism for specifying the interconnect. Starting with 2019, this was modified along with other major internal changes to run all inter-node communications through libfabric. Now, there are three options for I_MPI_FABRICS. shm (shared memory only, only valid for a single-node run), ofi (libfabric only), and shm:ofi (shared memory for intranode, libfabric for internode).

FI_PROVIDER sets the provider to be used by libfabric. By choosing shm here, we will still go through libfabric, and libfabric will use its own shared memory communications. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/fabrics-control/ofi-providers-support.html for our documentation regarding provider selection and https://github.com/ofiwg/libfabric for full details on libfabric.

View solution in original post

Kevin_McGrattan · ‎02-24-2021

Here is standard error with VT_VERBOSE=5

James_T_Intel · ‎02-24-2021

Thank you, I have sent this to our development team.

James_T_Intel · ‎02-25-2021

Please try running with FI_PROVIDER=verbs.

Kevin_McGrattan · ‎02-25-2021

When I set the PROVIDER to verbs, the job now finishes successfully, but I still get a warning about not freeing user-defined datatypes. Previously I was using shm, as this is a small job that runs on only a single node.

James_T_Intel · ‎02-25-2021

Progress! If you run without specifying the provider at all, what happens?

Kevin_McGrattan · ‎02-25-2021

If I do not specify the FI_PROVIDER explicitly in the batch script, the case runs and finishes successfully. We have been explicitly exporting FI_PROVIDER=shm for jobs that run only on one node, and we have not been specifying otherwise. We are using psm and libfabric/1.10.1 because this cluster has older Qlogic cards on it and we cannot use the precompiled fabrics that come with oneAPI.

I forget now why we are explicitly setting FI_PROVIDER=shm for one-node jobs, and not setting anything otherwise. My guess is that we had some trouble getting things to work and just landed on this particular setup. It works, with the one exception of hangs in MPI_FINALIZE when -check_mpi is invoked at compile time. Also, there is a warning about an unfreed user-defined MPI datatype, but I have not figured out why I get that warning as I do explicitly free this datatype each time it is created. Maybe the fact that I commit it multiple times might be a problem.

James_T_Intel · ‎02-25-2021

I recommend not setting the provider unless it is explicitly necessary.

I'll check with development regarding the possibility of multiple commits causing the warning.

Kevin_McGrattan · ‎02-26-2021

I remember now why we export FI_PROVIDER=shm when we run MPI jobs on a single node. Normally we use the libfabric psm for jobs that run across multiple nodes, and when we do so we use the SLURM SBATCH parameter --exclusive to only allow one job to use those nodes. However for small jobs that use only a few cores, we want to allow multiple jobs to share a node. However, we've noticed that a file called

psm_shm.0fff0fff-0000-0000-0000-0fff0fff0fff

owned by the current person using the node is put into /dev/shm when we use psm, blocking all other potential jobs from using this node. We do not have this problem when we use FI_PROVIDER=shm.

Soooo -- shm lets us run multiple jobs on the same node, but it causes MPI_FINALIZE to hang when using -check_mpi. psm does not exhibit this problem, but it also does not allow multiple jobs to run on the same node, at least we cannot figure out how to do so.

I know this is very confusing. It all started with the installation of Qlogic cards on this cluster, which has led us to use psm, but this has its quirks.

James_T_Intel · ‎02-26-2021

What happens if you use I_MPI_FABRICS=shm instead?

Kevin_McGrattan · ‎02-26-2021

Then it works. But why?

To recap: if I compile my code with -check_mpi and export FI_PROVIDER=shm in my batch script, the job hangs in MPI_FINALIZE. However, if I export I_MPI_FABRICS=shm instead, the job does not hang.

So this solves the problem, but dare I ask why? What is the difference between FI_PROVIDER and I_MPI_FABRICS?

James_T_Intel · ‎02-26-2021

Short version: I_MPI_FABRICS=shm will use the Intel® MPI Library shared memory implementation, FI_PROVIDER=shm will use the libfabric shared memory implementation.

I_MPI_FABRICS is used to set the communication provider used by Intel® MPI Library. In older versions, this was the primary mechanism for specifying the interconnect. Starting with 2019, this was modified along with other major internal changes to run all inter-node communications through libfabric. Now, there are three options for I_MPI_FABRICS. shm (shared memory only, only valid for a single-node run), ofi (libfabric only), and shm:ofi (shared memory for intranode, libfabric for internode).

FI_PROVIDER sets the provider to be used by libfabric. By choosing shm here, we will still go through libfabric, and libfabric will use its own shared memory communications. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/fabrics-control/ofi-providers-support.html for our documentation regarding provider selection and https://github.com/ofiwg/libfabric for full details on libfabric.

James_T_Intel · ‎03-10-2021

Intel support will no longer be monitoring this thread. Any further posts are considered community only. For additional assistance related to this issue, please start a new thread.