Intel MPI intermittent failures

Bryan_C_1 · ‎09-16-2016

I have a couple of users that are experiencing intermittent failures using Intel MPI (typically versions 4.1.0 or 4.1.3) on RedHat 6.4 and 6.7 systems using Mellanox OFED 2.0 and 3.1 respectively.

The error messages being seen are as follows:

[80:node1] unexpected reject event from 16:node2
Assertion failed in file ../../dapl_conn_rc.c at line 992: 0

or

[0:node1] unexpected DAPL event 0x4003

Assertion failed in file ../../dapl_init_rc.c at line 1332: 0

These errors are happening extremely intermittently on both systems. I believe that the jobs are relying on default values for I_MPI_FABRICS (shm:dapl) and I_MPI_DAPL_PROVIDER (should be ofa-v2-mlx4_0-1 on both systems).

It seems like these are DAPL layer errors. Any ideas on what might cause these sorts of intermittent failures?

Thanks!

Michael_Intel · ‎09-26-2016

Hi Bryan,

In general, it is worth to mention that your Intel MPI installation is quite outdated and you should try running a more recent version.

Also, as you already recognized, it seems that your issues are coming from the DAPL layer, therefore you might want to try out a newer version of the DAPL library which you may find on the OpenFabrics site (http://downloads.openfabrics.org/downloads/dapl/).

Best regards,

Michael

Bryan_C_1 · ‎09-26-2016

Michael,

Thanks. We have a newer Intel MPI version (5.0.2) available, but these applications haven't migrated to it as of yet. We are currently testing against it to see if the failures continue.

I will look into possibly upgrading DAPL as well, but that might be a more complicated issue.

Thanks,
Bryan

Michael_Intel · ‎09-27-2016

Hi Bryan,

One further note, since the crashes you observe are happening in the DAPL RC protocol, you might want to try DAPL UD as a potential workaround (I_MPI_DAPL_UD=1).

Best regards,

Michael

Bryan_C_1 · ‎09-28-2016

Michael,

We added I_MPI_DAPL_UD=1 to our job that has been simulating/reproducing the failures. It has been running under IMPI 4.1.3 on various RedHat versions. Unfortunately, that did not resolve the issue, as failures were still seen under 4.1.3 with DAPL UD enabled.

5.0.2 on RH 6.7 has proven to be stable (without DAPL UD = 1) it seems, but 5.0.2 on the older RedHat 6.4 system has not. I think this is related to the DAPL versions available on each system.

The error seen this time on 4.1.3 with DAPL_UD = 1 was:

[84:node1] unexpected ep_handle=0x59243d0 in DAPL conn event 0x4003

Thanks for any additional insight you could provide.

- Bryan

Michael_Intel · ‎09-28-2016

Hi Bryan,

Thanks for the feedback, could you please give a newer DAPL library version a try?

You don't have to install it system wide, just install it in your home directory and prepend the location to your LD_LIBRARY_PATH.

Best regards,

Michael

Bryan_C_1 · ‎09-28-2016

Michael,

I'll give that a try. Let me see what I can do.

- Bryan

Bryan_C_1 · ‎10-04-2016

Michael,

I've not been able to test against the updated DAPL libraries as of yet, but we think we may have found the issue.

It turns out that we increased the value of kernel.pid_max on the systems from 32768 to 65535. This seemed like a benign change at the time. However, it looks like that is actually causing the intermittent failures. Is it possible that IMPI or the DAPL libraries are using a 16-bit integer (short) as a PID value? This would explain the intermittent nature of it.

- Bryan

Michael_Intel · ‎10-05-2016

Hi Bryan,

The Intel MPI library is using integers as well as unsigned integers (both 32bit) to handle process IDs.
Regarding the DAPL library - I just checked with one of our developers - it seems that there are also at least 16 bit unsigned values being used.

Best regards,
Michael

Bryan_C_1 · ‎10-05-2016

Hmmm.. Guess it might be at another level. Maybe the OFED libraries involved. I think we've solved the issue, just not sure where it actually lies.

Thanks for checking and the support!

- Bryan