Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI_Send(177) fail

YaoYe
Novice
1,351 Views

When use mpirun, there is one error.  Need your help.

Fatal error in PMPI_Send

PMPI_Send(177) : MPI_Send(buf=xxxxxx, count=1,MPI_INT,dest=0,tag=1,MPI_COMM_WORLD) failed

MPID_Send(256)

MPIDI_OFI_send_lightweight(52)

MPIDI_OFI_send_handler(704): OFI tagged inject failed(of i_impl.h:704:MPIDI_OFI_send_hander: No such file or drectory)

Labels (1)
0 Kudos
5 Replies
PrasanthD_intel
Moderator
1,337 Views

Hi,


From the error you have posted, we were not sure where it had been failed in MPI_SEND().

Can you share the source code or any reproducible code with us so that we can debug the code from our side?

If sharing code isn't possible you can check corrcectness of code using ITAC(Intel Trace analyzer and Collector).

source the itacvars.sh - source <install_dir>/2019.x.xx/bin/itacvars.sh

and then run mpi with -check_mpi flag

mpirun -np <> -check_mpi ./program

For more info please check: https://software.intel.com/content/www/us/en/develop/documentation/itc-user-and-reference-guide/top/user-guide/correctness-checking/correctness-checking-of-mpi-applications.html


Post the logs after running with ITAC and setting I_MPI_DEBUG=10

export I_MPI_DEBUG=10


Regards

Prasanth


0 Kudos
YaoYe
Novice
1,332 Views

Hi,

Thank for your response.

I am using MPI 2019.0.117.  I used out of box source code. It is in /opt/intel/impi/2019.0.117/test/test.c

# pwd

/opt/intel/impi/2019.0.117/test

# mpiicc -cc=gcc test.c testc

......

# pwd

/opt/intel/impi/2019.0.117/intel64/bin

# export I_MPI_DEBUG=10

# mpirun -np 80 -ppn 1 -hosts master.localdomain,node.localdomain ../../test/testc

[0] MPI startup():  libfabric version 1.6.1a1-impi

[0] MPI startup():  libfabric version provider: sockets

wait for long time, no any response

So ctrl+c

then exit,  ssh root@master.localdomain without login

# cd /opt/intel/impi/2019.0.117/intel64/bin

# mpirun -np 80 -ppn 1 -hosts master.localdomain,node.localdomain ../../test/testc

helloworld: rank 0 of 80 running on master.localdomain

helloworld: rank 1 of 80 running on node.localdomain

.....

helloworld: rank 79 of 80 running on node.localdomain

It can work.

Some times need to exit to ssh again.  then run it. It is ok.

Maybe it is environment issue. I need detect it.

 

 

 

 

 

 

0 Kudos
PrasanthD_intel
Moderator
1,302 Views

Hi,

Based on the documentation on IMPI errors we think this error might be due to interconnect and provider mismatch.

Please refer this for more details: https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-windows/top/troubleshooting/error-message-fatal-error.html

Do both your nodes master.localdomain, node.localdomain have the same type of hardware interconnect?

Could you please provide us with all available providers in your node, you can get that by running fi_info.

Also, share us the full I_MPI_DEBUG log.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
1,287 Views

Hi,


It looks the problem is with the configuration the NIC card in the master.localdomain node. Could you please check the configuration once and see if everything is alright?

Also along with logs of I_MPI_DEBUG as I previously asked, could you also provide logs after setting FI_LOG_LEVEL=DEBUG.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
1,233 Views

Hi,


We are closing this thread assuming your problem is resolved.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Regards

Prasanth


0 Kudos
Reply