Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

intel mpi error

Bernie_B_
Beginner
1,832 Views

We have a new cluster with Mellanox FDR Infiniband interconnect and sometimes get the following error when running Intel MPI :

[15] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c[

16] Abort: Error code in polled desc!

[16] Abort: Got FATAL event 3at line 1010 in file ../../ofa_utility.c

at line 2346 in file ../../ofa_init.c[

159] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c

[0] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c

We have also seen this error when running over a very large nodeset :

send desc error[400] Abort: Got completion with error 9, vendor code=8a, dest rank=at line 870 in file ../../ofa_poll.c

I am not seeing this type of error at all using OPENMPI.  The cluster is using OFED (not the mellanox vendor supplied one).  We are using Torque as our resource manager.

Any help diagnosing this would be appreciated.

Bernie

0 Kudos
5 Replies
Gergana_S_Intel
Employee
1,832 Views

Hey Bernie,

Thanks for posting.

The errors you're seeing are coming from the OFED software stack. It's very likely you're not using a suitable provider when running your Intel MPI jobs. Can you provide a couple of pieces of information?

It'll be good to know what Intel MPI Library version you're running, as well as your full command line and if you're setting any Intel MPI-specific environment variables. Also, please provide your /etc/dat.conf file. I should be able to tell you which provider you'd need to use based on that.

I look forward to hearing back soon.

Regards,
~Gergana

0 Kudos
Bernie_B_
Beginner
1,832 Views
Gergana - thanx for the quick reply. I am running the latest Intel MPI - 4.1.0.027 I am setting the following environment variables : I_MPI_FABRICS=shm:ofa I_MPI_DEBUG=2 I_MPI_ROOT=/fltapps/boeing/mpi/intel/impi/4.1.0.027 I_MPI_EXTRA_FILESYSTEM=1 I_MPI_EXTRA_FILESYSTEM_LIST=panfs We have a panasas file system. I was under the impression that that version did not require an /etc/dat.conf and Intel MPI supported IB natively without a DAPL layer. Here is the command line : /usr/bin/time mpirun -np $NPROCS $OVEREXE >& over.1.out $OVEREXE is the program that we are running and NPROCS is the number of processors to use. This job is run under torque and should be able to pick up the node list from the queuing system. To clarify we don't see the error all the time, just sometimes when the job is submitted. So it sounds like we have an OFED problem on one or more of the nodes since it works most of the time. any other clues you can give me to diagnose what is wrong would be appreciated. Bernie
0 Kudos
John_Gilmore
Beginner
1,832 Views
Hi guys, I seem to have the same issue. When I run my Intel MPI job (that runs on both MVAPICH2 and Open MPI), I receive the following error: [6] Abort: Got FATAL event 3 at line 1010 in file ../../ofa_utility.c recv desc error, 128, 0x61b880 [1] Abort: Got completion with error 9, vendor code=8a, dest rank= at line 870 in file ../../ofa_poll.c After this, the application just blocks. I'm running MPI 4.1.0.024. It's an MPMD application. My command line is: mpirun -perhost 3 -f /home/john/App/src/hostfile_intel \ -n 3 -env I_MPI_FABRICS shm:ofa ./AppA : \ -n 9 -env I_MPI_FABRICS shm:ofa ./AppB I don't have an etc/dat.conf file. My OS and architecture is:Linux hostname 3.3.8-1.fc16.x86_64 #1 SMP Mon Jun 4 20:49:02 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux Does anyone have any ideas? I've not installed any OFA libraries explicitly, but for the other MPI implementation I didn't to. We're using Mellanox Infiniband adaptors. Any help would be greatly appreciated. Regards John
0 Kudos
pankajd
Beginner
1,832 Views

we are getting same error while running vasp.5.3.3 on centos 6.3, intel composer_xe_2013.1.117, intel mpi 4.1.0, with mellanox ofed, 56 gbps ib connected nodes (snb processors).

errror

send desc error
> > [8] Abort: [13] Abort: Got completion with error 12, vendor code?, dest
> > rank> at line 870 in file ../../ofa_poll.c
> > [14] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [9] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [10] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [11] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [12] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > send desc error
> > [30] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c

0 Kudos
Bernie_B_
Beginner
1,832 Views

We were eventually able to track this to bad cables that caused the IB to tail.  So it had nothng to do with the software or Ofed stack at all.

Bernie

0 Kudos
Reply