<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Michael, in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079350#M4853</link>
    <description>&lt;P&gt;Michael,&lt;/P&gt;

&lt;P&gt;I've not been able to test against the updated DAPL libraries as of yet, but we think we may have found the issue.&lt;/P&gt;

&lt;P&gt;It turns out that we increased the value of kernel.pid_max on the systems from 32768 to 65535.&amp;nbsp; This seemed like a benign change at the time.&amp;nbsp; However, it looks like that is actually causing the intermittent failures.&amp;nbsp; Is it possible that IMPI or the DAPL libraries are using a 16-bit integer (short) as a PID value?&amp;nbsp; This would explain the intermittent nature of it.&lt;/P&gt;

&lt;P&gt;- Bryan&lt;/P&gt;</description>
    <pubDate>Tue, 04 Oct 2016 14:20:44 GMT</pubDate>
    <dc:creator>Bryan_C_1</dc:creator>
    <dc:date>2016-10-04T14:20:44Z</dc:date>
    <item>
      <title>Intel MPI intermittent failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079343#M4846</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I have a couple of users that are experiencing intermittent failures using Intel MPI (typically versions 4.1.0 or 4.1.3) on RedHat 6.4 and 6.7 systems using Mellanox OFED 2.0 and 3.1 respectively.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The error messages being seen are as follows:&lt;/P&gt;

&lt;DIV style="margin:0;"&gt;[80:node1] unexpected reject event from 16:node2&lt;BR /&gt;
	Assertion failed in file ../../dapl_conn_rc.c at line 992: 0&lt;/DIV&gt;

&lt;DIV style="margin:0;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV style="margin:0;"&gt;or&lt;/DIV&gt;

&lt;DIV style="margin:0;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV style="margin:0;"&gt;&lt;SPAN style="font-size: 1em;"&gt;[0:node1] unexpected DAPL event 0x4003&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;P&gt;Assertion failed in file ../../dapl_init_rc.c at line 1332: 0&lt;/P&gt;

&lt;P&gt;These errors are happening extremely intermittently on both systems. I believe that the jobs are relying on default values for &amp;nbsp;I_MPI_FABRICS (shm:dapl) and I_MPI_DAPL_PROVIDER (should be&amp;nbsp;&lt;SPAN style="font-variant-ligatures: no-common-ligatures"&gt;ofa-v2-mlx4_0-1 on both systems).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;It seems like these are DAPL layer errors. &amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;Any ideas on what might cause these sorts of intermittent failures?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2016 15:45:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079343#M4846</guid>
      <dc:creator>Bryan_C_1</dc:creator>
      <dc:date>2016-09-16T15:45:08Z</dc:date>
    </item>
    <item>
      <title>Hi Bryan,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079344#M4847</link>
      <description>&lt;P&gt;Hi Bryan,&lt;/P&gt;

&lt;P&gt;In general, it is worth to mention that your Intel MPI installation is quite outdated and you should try running a more recent version.&lt;/P&gt;

&lt;P&gt;Also, as you already recognized, it seems that your issues are coming from the DAPL layer, therefore you might want to try out a newer version of the DAPL library which you may find on the OpenFabrics site (&lt;A href="http://downloads.openfabrics.org/downloads/dapl/"&gt;http://downloads.openfabrics.org/downloads/dapl/&lt;/A&gt;).&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Michael&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Sep 2016 12:28:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079344#M4847</guid>
      <dc:creator>Michael_Intel</dc:creator>
      <dc:date>2016-09-26T12:28:41Z</dc:date>
    </item>
    <item>
      <title>Michael,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079345#M4848</link>
      <description>&lt;P&gt;Michael,&lt;/P&gt;

&lt;P&gt;Thanks.&amp;nbsp; We have a newer Intel MPI version (5.0.2) available, but these applications haven't migrated to it as of yet.&amp;nbsp; We are currently testing against it to see if the failures continue.&lt;/P&gt;

&lt;P&gt;I will look into possibly upgrading DAPL as well, but that might be a more complicated issue.&lt;/P&gt;

&lt;P&gt;Thanks,&lt;BR /&gt;
	Bryan&lt;/P&gt;</description>
      <pubDate>Mon, 26 Sep 2016 15:28:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079345#M4848</guid>
      <dc:creator>Bryan_C_1</dc:creator>
      <dc:date>2016-09-26T15:28:18Z</dc:date>
    </item>
    <item>
      <title>Hi Bryan,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079346#M4849</link>
      <description>&lt;P&gt;Hi Bryan,&lt;/P&gt;

&lt;P&gt;One further note, since the crashes you observe are happening in the DAPL RC protocol, you might want to try DAPL UD as a potential workaround (I_MPI_DAPL_UD=1).&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Michael&lt;/P&gt;</description>
      <pubDate>Tue, 27 Sep 2016 07:58:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079346#M4849</guid>
      <dc:creator>Michael_Intel</dc:creator>
      <dc:date>2016-09-27T07:58:17Z</dc:date>
    </item>
    <item>
      <title>Michael,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079347#M4850</link>
      <description>&lt;P&gt;Michael,&lt;/P&gt;

&lt;P&gt;We added I_MPI_DAPL_UD=1 to our job that has been simulating/reproducing the failures.&amp;nbsp; It has been running under IMPI 4.1.3&amp;nbsp; on various RedHat versions.&amp;nbsp; Unfortunately, that did not resolve the issue, as failures were still seen under 4.1.3 with DAPL UD enabled.&lt;/P&gt;

&lt;P&gt;5.0.2 on RH 6.7 has proven to be stable (without DAPL UD = 1) it seems, but 5.0.2 on the older RedHat 6.4 system has not.&amp;nbsp; I think this is related to the DAPL versions available on each system.&lt;/P&gt;

&lt;P&gt;The error seen this time on 4.1.3 with DAPL_UD = 1 was:&lt;/P&gt;

&lt;P&gt;[84:node1] unexpected ep_handle=0x59243d0 in DAPL conn event 0x4003&lt;/P&gt;

&lt;P&gt;Thanks for any additional insight you could provide.&lt;/P&gt;

&lt;P&gt;- Bryan&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2016 15:26:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079347#M4850</guid>
      <dc:creator>Bryan_C_1</dc:creator>
      <dc:date>2016-09-28T15:26:47Z</dc:date>
    </item>
    <item>
      <title>Hi Bryan,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079348#M4851</link>
      <description>&lt;P&gt;Hi Bryan,&lt;/P&gt;

&lt;P&gt;Thanks for the feedback, could you please give a newer DAPL library version a try?&lt;/P&gt;

&lt;P&gt;You don't have to install it system wide, just install it in your home directory and prepend the location to your LD_LIBRARY_PATH.&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Michael&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2016 15:32:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079348#M4851</guid>
      <dc:creator>Michael_Intel</dc:creator>
      <dc:date>2016-09-28T15:32:52Z</dc:date>
    </item>
    <item>
      <title>Michael,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079349#M4852</link>
      <description>&lt;P&gt;Michael,&lt;/P&gt;

&lt;P&gt;I'll give that a try.&amp;nbsp; Let me see what I can do.&lt;/P&gt;

&lt;P&gt;- Bryan&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2016 15:50:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079349#M4852</guid>
      <dc:creator>Bryan_C_1</dc:creator>
      <dc:date>2016-09-28T15:50:11Z</dc:date>
    </item>
    <item>
      <title>Michael,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079350#M4853</link>
      <description>&lt;P&gt;Michael,&lt;/P&gt;

&lt;P&gt;I've not been able to test against the updated DAPL libraries as of yet, but we think we may have found the issue.&lt;/P&gt;

&lt;P&gt;It turns out that we increased the value of kernel.pid_max on the systems from 32768 to 65535.&amp;nbsp; This seemed like a benign change at the time.&amp;nbsp; However, it looks like that is actually causing the intermittent failures.&amp;nbsp; Is it possible that IMPI or the DAPL libraries are using a 16-bit integer (short) as a PID value?&amp;nbsp; This would explain the intermittent nature of it.&lt;/P&gt;

&lt;P&gt;- Bryan&lt;/P&gt;</description>
      <pubDate>Tue, 04 Oct 2016 14:20:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079350#M4853</guid>
      <dc:creator>Bryan_C_1</dc:creator>
      <dc:date>2016-10-04T14:20:44Z</dc:date>
    </item>
    <item>
      <title>Hi Bryan,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079351#M4854</link>
      <description>&lt;P&gt;Hi Bryan,&lt;/P&gt;

&lt;P&gt;The Intel MPI library is using integers as well as unsigned integers (both 32bit) to handle process IDs.&lt;BR /&gt;
	Regarding the DAPL library - I just checked with one of our developers - it seems that there are also at least 16 bit unsigned values being used.&lt;/P&gt;

&lt;P&gt;Best regards,&lt;BR /&gt;
	Michael&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Oct 2016 13:10:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079351#M4854</guid>
      <dc:creator>Michael_Intel</dc:creator>
      <dc:date>2016-10-05T13:10:58Z</dc:date>
    </item>
    <item>
      <title>Hmmm.. Guess it might be at</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079352#M4855</link>
      <description>&lt;P&gt;Hmmm.. Guess it might be at another level.&amp;nbsp; Maybe the OFED libraries involved.&amp;nbsp; I think we've solved the issue, just not sure where it actually lies.&lt;/P&gt;

&lt;P&gt;Thanks for checking and the support!&lt;/P&gt;

&lt;P&gt;- Bryan&lt;/P&gt;</description>
      <pubDate>Wed, 05 Oct 2016 14:49:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-intermittent-failures/m-p/1079352#M4855</guid>
      <dc:creator>Bryan_C_1</dc:creator>
      <dc:date>2016-10-05T14:49:45Z</dc:date>
    </item>
  </channel>
</rss>

