Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2211 Discussions

MPI intercommunicator issue

GDN
Novice
2,595 Views

Dear Community,

 

I'm using 3 softwares, which are communicating with MPI:

- software1 is a coupling program. It does some mapping and data exchange between clients (which are software2 and software3).

- software2 is a fluid solver.

- software3 is a structure solver.

 

These 3 softwares are started with mpirun:

mpirun -np 1 software1

mpirun -np 1 software2

mpirun -np 1 software3

 

These 3 softwares worked well with Intel MPI and DAPL until the version 2018. When I'm trying to use Intel MPI 2019 or later, the coupling software software1 hangs.

 

Trying to debug the problem I wrote two short programs server.cpp and client.cpp to mimic what software1 and  software2 are doing at the beginning (see files in attachment).

 

mpiicpc -debug client.cpp -o client

mpiicpc -debug server.cpp -o server

mpirun -check_mpi -n 1 server

mpirun -check_mpi -n 1 client

 

 

I get two problems:

  • With the latest Intel OneAPI, the intercommunicator client and server are identical (84000007). If I run the same programs with intel MPI 2018 the  intercommunicator from the client (84000001) differs from the server (84000000). What am I doing wrong? It is a bug in my programs?

Output from server with the lastest Intel OneAPI (I_MPI_FABRICS=shm:ofi on CentOS 7 with Mellanox card OFED driver):

 

intra-communicator: MPI_COMM_WORLD---44000000
server available at tag#0$connentry#0200962CC0A864FE0000000000000000$
intercommunicator client=84000007

 

 

Output from client with the lastest Intel OneAPI (I_MPI_FABRICS=shm:ofi on CentOS 7 with Mellanox card OFED driver):

 

intra-communicator: MPI_COMM_WORLD---44000000
port_name_s=tag#0$connentry#0200962CC0A864FE0000000000000000$
intercommunicator server=84000007

 

 

Output from server with the Intel 2018.2 (I_MPI_FABRICS=shm:ofa on Red Hat Enterprise Linux Server release 7.9 (Maipo) with Mellanox card OFED driver):

intra-communicator: MPI_COMM_WORLD---44000000
server available at tag#0$OFA#00000007:00000e9d:00000e9e$rdma_port#1024$rdma_host#10:0:7:0:0:14:155:254:128:0:0:0:0:0:0$
intercommunicator client=84000000

 

Output from client with the Intel 2018.2 (I_MPI_FABRICS=shm:ofa on Red Hat Enterprise Linux Server release 7.9 (Maipo) with Mellanox card OFED driver):

intra-communicator: MPI_COMM_WORLD---44000000
port_name_s=tag#0$OFA#00000007:00000e9d:00000e9e$rdma_port#1024$rdma_host#10:0:7:0:0:14:155:254:128:0:0:0:0:0:0$
intercommunicator server=84000001

 

  • With the latest Intel OneAPI (I_MPI_FABRICS=shm:ofi) but also the older one (2018) "mpirun -check_mpi" failed with an error on the communicator (in client.cpp with MPI_Comm_disconnect for example). I do not understand the problem here:

 

intra-communicator: MPI_COMM_WORLD---44000000
port_name_s=tag#0$connentry#0200962CC0A864FE0000000000000000$
intercommunicator server=84000007

[0] ERROR: LOCAL:MPI:CALL_FAILED: error
[0] ERROR:    Invalid communicator.
[0] ERROR:    Error occurred at:
[0] ERROR:       MPI_Comm_disconnect(*comm=0x7ffd90761944->0xffffffff84000007 <<invalid>>)

 

 

Thx in advance

Regards

Guillaume

 

 

 

 

Labels (1)
0 Kudos
6 Replies
HemanthCH_Intel
Moderator
2,546 Views

Hi,


Thank you for posting in Intel Communities.


Thanks for providing all the details.


We are able to reproduce your issue at our end using both Intel MPI 2018 update 2 and the latest(2021.6) on a Rocky Linux machine. We are working on your issue and will get back to you soon.


Thanks & Regards,

Hemanth


0 Kudos
GDN
Novice
2,535 Views

Thx Hemanth,

 

When I try to debug with Intel Trace I also get an error about the communicators. Is there any standards about  communicators, intercommunicators?

 

In my case they are negative. Can it be the problem?

 

Regards

0 Kudos
HemanthCH_Intel
Moderator
2,498 Views

Hi,


We are working on your issue and will get back to you soon.


Thanks & Regards,

Hemanth


0 Kudos
HemanthCH_Intel
Moderator
2,483 Views

Hi,


Could you please provide the following details to investigate more on your issue?

  1. Screenshot of the error using ITAC.
  2. Steps to reproduce your issue at our end.


Thanks & Regards,

Hemanth


0 Kudos
GDN
Novice
2,478 Views

Hi,

 

I apologize! With the short programs I provided above, no error occurs using ITAC (-trace option). I was confused since I got an error with ITAC using my complete programs.

 

Error occurs only with the "-debug_mpi" option.

 

Regards,

Guillaume

0 Kudos
HemanthCH_Intel
Moderator
2,440 Views

Hi,


Thanks for reporting this issue. We were able to reproduce it and we have informed the development team about it.


Thanks & Regards,

Hemanth


0 Kudos
Reply