Community
cancel
Showing results for 
Search instead for 
Did you mean: 
YaoYe
Novice
327 Views

MPI_Send(177) fail

When use mpirun, there is one error.  Need your help.

Fatal error in PMPI_Send

PMPI_Send(177) : MPI_Send(buf=xxxxxx, count=1,MPI_INT,dest=0,tag=1,MPI_COMM_WORLD) failed

MPID_Send(256)

MPIDI_OFI_send_lightweight(52)

MPIDI_OFI_send_handler(704): OFI tagged inject failed(of i_impl.h:704:MPIDI_OFI_send_hander: No such file or drectory)

Labels (1)
Tags (1)
0 Kudos
5 Replies
PrasanthD_intel
Moderator
313 Views

Hi,


From the error you have posted, we were not sure where it had been failed in MPI_SEND().

Can you share the source code or any reproducible code with us so that we can debug the code from our side?

If sharing code isn't possible you can check corrcectness of code using ITAC(Intel Trace analyzer and Collector).

source the itacvars.sh - source <install_dir>/2019.x.xx/bin/itacvars.sh

and then run mpi with -check_mpi flag

mpirun -np <> -check_mpi ./program

For more info please check: https://software.intel.com/content/www/us/en/develop/documentation/itc-user-and-reference-guide/top/...


Post the logs after running with ITAC and setting I_MPI_DEBUG=10

export I_MPI_DEBUG=10


Regards

Prasanth


YaoYe
Novice
308 Views

Hi,

Thank for your response.

I am using MPI 2019.0.117.  I used out of box source code. It is in /opt/intel/impi/2019.0.117/test/test.c

# pwd

/opt/intel/impi/2019.0.117/test

# mpiicc -cc=gcc test.c testc

......

# pwd

/opt/intel/impi/2019.0.117/intel64/bin

# export I_MPI_DEBUG=10

# mpirun -np 80 -ppn 1 -hosts master.localdomain,node.localdomain ../../test/testc

[0] MPI startup():  libfabric version 1.6.1a1-impi

[0] MPI startup():  libfabric version provider: sockets

wait for long time, no any response

So ctrl+c

then exit,  ssh root@master.localdomain without login

# cd /opt/intel/impi/2019.0.117/intel64/bin

# mpirun -np 80 -ppn 1 -hosts master.localdomain,node.localdomain ../../test/testc

helloworld: rank 0 of 80 running on master.localdomain

helloworld: rank 1 of 80 running on node.localdomain

.....

helloworld: rank 79 of 80 running on node.localdomain

It can work.

Some times need to exit to ssh again.  then run it. It is ok.

Maybe it is environment issue. I need detect it.

 

 

 

 

 

 

PrasanthD_intel
Moderator
278 Views

Hi,

Based on the documentation on IMPI errors we think this error might be due to interconnect and provider mismatch.

Please refer this for more details: https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-windows/top/t...

Do both your nodes master.localdomain, node.localdomain have the same type of hardware interconnect?

Could you please provide us with all available providers in your node, you can get that by running fi_info.

Also, share us the full I_MPI_DEBUG log.


Regards

Prasanth


PrasanthD_intel
Moderator
263 Views

Hi,


It looks the problem is with the configuration the NIC card in the master.localdomain node. Could you please check the configuration once and see if everything is alright?

Also along with logs of I_MPI_DEBUG as I previously asked, could you also provide logs after setting FI_LOG_LEVEL=DEBUG.


Regards

Prasanth


PrasanthD_intel
Moderator
209 Views

Hi,


We are closing this thread assuming your problem is resolved.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Regards

Prasanth


Reply