- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Community,
I'm using 3 softwares, which are communicating with MPI:
- software1 is a coupling program. It does some mapping and data exchange between clients (which are software2 and software3).
- software2 is a fluid solver.
- software3 is a structure solver.
These 3 softwares are started with mpirun:
mpirun -np 1 software1
mpirun -np 1 software2
mpirun -np 1 software3
These 3 softwares worked well with Intel MPI and DAPL until the version 2018. When I'm trying to use Intel MPI 2019 or later, the coupling software software1 hangs.
Trying to debug the problem I wrote two short programs server.cpp and client.cpp to mimic what software1 and software2 are doing at the beginning (see files in attachment).
mpiicpc -debug client.cpp -o client
mpiicpc -debug server.cpp -o server
mpirun -check_mpi -n 1 server
mpirun -check_mpi -n 1 client
I get two problems:
- With the latest Intel OneAPI, the intercommunicator client and server are identical (84000007). If I run the same programs with intel MPI 2018 the intercommunicator from the client (84000001) differs from the server (84000000). What am I doing wrong? It is a bug in my programs?
Output from server with the lastest Intel OneAPI (I_MPI_FABRICS=shm:ofi on CentOS 7 with Mellanox card OFED driver):
intra-communicator: MPI_COMM_WORLD---44000000
server available at tag#0$connentry#0200962CC0A864FE0000000000000000$
intercommunicator client=84000007
Output from client with the lastest Intel OneAPI (I_MPI_FABRICS=shm:ofi on CentOS 7 with Mellanox card OFED driver):
intra-communicator: MPI_COMM_WORLD---44000000
port_name_s=tag#0$connentry#0200962CC0A864FE0000000000000000$
intercommunicator server=84000007
Output from server with the Intel 2018.2 (I_MPI_FABRICS=shm:ofa on Red Hat Enterprise Linux Server release 7.9 (Maipo) with Mellanox card OFED driver):
intra-communicator: MPI_COMM_WORLD---44000000
server available at tag#0$OFA#00000007:00000e9d:00000e9e$rdma_port#1024$rdma_host#10:0:7:0:0:14:155:254:128:0:0:0:0:0:0$
intercommunicator client=84000000
Output from client with the Intel 2018.2 (I_MPI_FABRICS=shm:ofa on Red Hat Enterprise Linux Server release 7.9 (Maipo) with Mellanox card OFED driver):
intra-communicator: MPI_COMM_WORLD---44000000
port_name_s=tag#0$OFA#00000007:00000e9d:00000e9e$rdma_port#1024$rdma_host#10:0:7:0:0:14:155:254:128:0:0:0:0:0:0$
intercommunicator server=84000001
- With the latest Intel OneAPI (I_MPI_FABRICS=shm:ofi) but also the older one (2018) "mpirun -check_mpi" failed with an error on the communicator (in client.cpp with MPI_Comm_disconnect for example). I do not understand the problem here:
intra-communicator: MPI_COMM_WORLD---44000000
port_name_s=tag#0$connentry#0200962CC0A864FE0000000000000000$
intercommunicator server=84000007
[0] ERROR: LOCAL:MPI:CALL_FAILED: error
[0] ERROR: Invalid communicator.
[0] ERROR: Error occurred at:
[0] ERROR: MPI_Comm_disconnect(*comm=0x7ffd90761944->0xffffffff84000007 <<invalid>>)
Thx in advance
Regards
Guillaume
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Thanks for providing all the details.
We are able to reproduce your issue at our end using both Intel MPI 2018 update 2 and the latest(2021.6) on a Rocky Linux machine. We are working on your issue and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thx Hemanth,
When I try to debug with Intel Trace I also get an error about the communicators. Is there any standards about communicators, intercommunicators?
In my case they are negative. Can it be the problem?
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on your issue and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please provide the following details to investigate more on your issue?
- Screenshot of the error using ITAC.
- Steps to reproduce your issue at our end.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I apologize! With the short programs I provided above, no error occurs using ITAC (-trace option). I was confused since I got an error with ITAC using my complete programs.
Error occurs only with the "-debug_mpi" option.
Regards,
Guillaume
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reporting this issue. We were able to reproduce it and we have informed the development team about it.
Thanks & Regards,
Hemanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page