- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have a new cluster with Mellanox FDR Infiniband interconnect and sometimes get the following error when running Intel MPI :
[15] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c[
16] Abort: Error code in polled desc!
[16] Abort: Got FATAL event 3at line 1010 in file ../../ofa_utility.c
at line 2346 in file ../../ofa_init.c[
159] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c
[0] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c
We have also seen this error when running over a very large nodeset :
send desc error[400] Abort: Got completion with error 9, vendor code=8a, dest rank=at line 870 in file ../../ofa_poll.c
I am not seeing this type of error at all using OPENMPI. The cluster is using OFED (not the mellanox vendor supplied one). We are using Torque as our resource manager.
Any help diagnosing this would be appreciated.
Bernie
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Bernie,
Thanks for posting.
The errors you're seeing are coming from the OFED software stack. It's very likely you're not using a suitable provider when running your Intel MPI jobs. Can you provide a couple of pieces of information?
It'll be good to know what Intel MPI Library version you're running, as well as your full command line and if you're setting any Intel MPI-specific environment variables. Also, please provide your /etc/dat.conf file. I should be able to tell you which provider you'd need to use based on that.
I look forward to hearing back soon.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we are getting same error while running vasp.5.3.3 on centos 6.3, intel composer_xe_2013.1.117, intel mpi 4.1.0, with mellanox ofed, 56 gbps ib connected nodes (snb processors).
errror
send desc error
> > [8] Abort: [13] Abort: Got completion with error 12, vendor code?, dest
> > rank> at line 870 in file ../../ofa_poll.c
> > [14] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [9] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [10] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [11] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [12] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > send desc error
> > [30] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We were eventually able to track this to bad cables that caused the IB to tail. So it had nothng to do with the software or Ofed stack at all.
Bernie
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page