<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Problems with IB and IntelMPI in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823287#M1202</link>
    <description>&lt;P&gt;Hi ivcores,&lt;/P&gt;&lt;P&gt;Unfortunately, that's a pretty generic error. All it means is, one of your MPI processes failed, and that took down the entire application run.&lt;/P&gt;&lt;P&gt;This could be caused by the use of an older provider (similar to a &lt;A target="_blank" href="http://software.intel.com/en-us/forums/showthread.php?t=74020"&gt;different issue on this forum&lt;/A&gt;). Could I see the output of your &lt;B&gt;ibstat&lt;/B&gt; tool? Also, what version of OFED do you have installed (running &lt;B&gt;ofed_info&lt;/B&gt; should tell you). We recommend the latest &lt;A target="_blank" href="http://www.openfabrics.org/downloads/OFED/ofed-1.5.1/"&gt;OFED 1.5.1&lt;/A&gt;, if possible.&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;~Gergana&lt;/P&gt;</description>
    <pubDate>Tue, 11 May 2010 21:14:19 GMT</pubDate>
    <dc:creator>Gergana_S_Intel</dc:creator>
    <dc:date>2010-05-11T21:14:19Z</dc:date>
    <item>
      <title>Problems with IB and IntelMPI</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823284#M1199</link>
      <description>&lt;PRE&gt;Dear HPC Forum.&lt;BR /&gt;I'm working with Intel MPI over a Linux cluster with InfiniBand network and I'm&lt;BR /&gt;having problems if I use more than 16 processes.&lt;BR /&gt;&lt;BR /&gt;I execute with the next command&lt;BR /&gt;&lt;BR /&gt;/home/apps/intel/impi/4.0.0.017/bin64/mpirun  -r ssh -f mpd.hosts -n 36 -env I_MPI_DEBUG 5 bt.B.36&lt;BR /&gt;&lt;BR /&gt;but i get the following error:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;[34] MPI startup(): DAPL provider OpenIB-cma&lt;BR /&gt;[33] MPI startup(): DAPL provider OpenIB-cma&lt;BR /&gt;...&lt;BR /&gt;[11] MPI startup(): shm and dapl data transfer modes&lt;BR /&gt;[28] MPI startup(): shm and dapl data transfer modes&lt;BR /&gt;...&lt;BR /&gt;[0] MPI startup(): static connections storm algo&lt;BR /&gt;compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32304&lt;BR /&gt;compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32305&lt;BR /&gt;compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32303&lt;BR /&gt;compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32306&lt;BR /&gt;[0:compute-0-3] unexpected DAPL event 0x4008&lt;BR /&gt;Assertion failed in file ../../dapl_module_init.c at line 3896: 0&lt;BR /&gt;internal ABORT - process 0&lt;BR /&gt;...&lt;BR /&gt;compute-0-2.local:27022: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.9,26580&lt;BR /&gt;rank 31 in job 1  compute-0-0.local_47979   caused collective abort of all ranks&lt;BR /&gt;  exit status of rank 31: return code 1 &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;My /etc/dat.conf is:&lt;BR /&gt;&lt;BR /&gt;OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""&lt;BR /&gt;OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""&lt;BR /&gt;OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""&lt;BR /&gt;OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""&lt;BR /&gt;OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""&lt;BR /&gt;OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""&lt;BR /&gt;OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""&lt;BR /&gt;OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""&lt;BR /&gt;OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""&lt;BR /&gt;OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""&lt;BR /&gt;ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""&lt;BR /&gt;ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""&lt;BR /&gt;ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""&lt;BR /&gt;ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""&lt;BR /&gt;ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""&lt;BR /&gt;ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""&lt;BR /&gt;ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""&lt;BR /&gt;ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""&lt;BR /&gt;ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""&lt;BR /&gt;ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""&lt;BR /&gt;ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""&lt;BR /&gt;ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""&lt;BR /&gt;ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""&lt;BR /&gt;ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""&lt;BR /&gt;&lt;BR /&gt;and I solved this problem with these settings:&lt;BR /&gt;&lt;BR /&gt;/home/apps/intel/impi/4.0.0.017/bin64/mpirun  -r ssh -f mpd.hosts -n 36 -env I_MPI_FABRICS_LIST "ofa,dapl,tcp" -env I_MPI_DEBUG 5 bt.B.36&lt;BR /&gt;&lt;BR /&gt;In this case the output is:&lt;BR /&gt;&lt;BR /&gt;[0] MPI startup(): shm and ofa data transfer modes&lt;BR /&gt;...&lt;BR /&gt;[35] MPI startup(): shm and ofa data transfer modes&lt;BR /&gt;[0] MPI startup(): I_MPI_DEBUG=5&lt;BR /&gt;[0] MPI startup(): I_MPI_FABRICS_LIST=ofa,dapl,tcp&lt;BR /&gt;[1] MPI Startup(): set domain to {3,11} on node compute-0-5.local&lt;BR /&gt;...&lt;BR /&gt;[31] MPI Startup(): set domain to {6,14} on node compute-0-7.local&lt;BR /&gt;[0] Rank    Pid      Node name          Pin cpu&lt;BR /&gt;[0] 0       32347    compute-0-5.local  {1,9}&lt;BR /&gt;...&lt;BR /&gt;[0] 35      29342    compute-0-1.local  {4,6,12,14}&lt;BR /&gt;rank 15 in job 1  compute-0-0.local_37285   caused collective abort of all ranks&lt;BR /&gt;  exit status of rank 15: killed by signal 9 &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I also try to execute with the flag -env I_MPI_USE_DYNAMIC_CONNECTIONS 0 but the result is the same.&lt;BR /&gt;&lt;BR /&gt;Finally, the output of ibv_devinfo is: &lt;BR /&gt;&lt;BR /&gt;hca_id:	qib0&lt;BR /&gt;	transport:			InfiniBand (0)&lt;BR /&gt;	fw_ver:				0.0.0&lt;BR /&gt;	node_guid:			0011:7500:00ff:8829&lt;BR /&gt;	sys_image_guid:			0011:7500:00ff:8829&lt;BR /&gt;	vendor_id:			0x1175&lt;BR /&gt;	vendor_part_id:			29216&lt;BR /&gt;	hw_ver:				0x2&lt;BR /&gt;	board_id:			InfiniPath_QLE7240&lt;BR /&gt;	phys_port_cnt:			1&lt;BR /&gt;		port:	1&lt;BR /&gt;			state:			PORT_ACTIVE (4)&lt;BR /&gt;			max_mtu:		4096 (5)&lt;BR /&gt;			active_mtu:		2048 (4)&lt;BR /&gt;			sm_lid:			9&lt;BR /&gt;			port_lid:		8&lt;BR /&gt;			port_lmc:		0x00&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;With OpenMPI it works correctly. Do you have any idea of what the problem is?&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;BR /&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 11 May 2010 11:28:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823284#M1199</guid>
      <dc:creator>ivcores</dc:creator>
      <dc:date>2010-05-11T11:28:01Z</dc:date>
    </item>
    <item>
      <title>Problems with IB and IntelMPI</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823285#M1200</link>
      <description>&lt;DIV id="_mcePaste"&gt;Hope it helps, I have found the following information while searching the web for your error message.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;-- Andres&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Workaround in Case of IntelMPI/uDAPL Error "unexpected DAPL event 4008"&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;The following error may occur on rare occasions with IntelMPI/uDAPL:"unexpected DAPL event 4008 from ..."&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;To work around this, add the following to your mpirun command:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;-genv I_MPI_USE_DYNAMIC_CONNECTIONS 0&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;This problem is caused by a limitation in Intel MPI/uDAPL's dynamic connection mechanism when MPI&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;processes are not sufficiently attentive to incoming interconnect traffic.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 May 2010 14:01:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823285#M1200</guid>
      <dc:creator>Andres_M_Intel4</dc:creator>
      <dc:date>2010-05-11T14:01:03Z</dc:date>
    </item>
    <item>
      <title>Problems with IB and IntelMPI</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823286#M1201</link>
      <description>&lt;BR /&gt;Hi Andres,&lt;BR /&gt;Thanks for your reply. Unfurtunately the problem persists. I execute the commmand&lt;BR /&gt;&lt;BR /&gt;/home/apps/intel/impi/4.0.0.017/bin64/mpirun -r ssh -f mpd.hosts -genv I_MPI_USE_DYNAMIC_CONNECTIONS 0 -n 36 -env I_MPI_FABRICS_LIST "ofa,dapl,tcp" -env I_MPI_DEBUG 5 bt.B.36&lt;BR /&gt;&lt;BR /&gt;and the error is the same:&lt;BR /&gt;&lt;BR /&gt;...&lt;BR /&gt;[0] 34 31432 compute-0-1.local {0,2,8,10}&lt;BR /&gt;[0] 35 31433 compute-0-1.local {4,6,12,14}&lt;BR /&gt;rank 28 in job 1 compute-0-0.local_48003 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 28: killed by signal 9 &lt;BR /&gt; &lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 11 May 2010 14:34:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823286#M1201</guid>
      <dc:creator>ivcores</dc:creator>
      <dc:date>2010-05-11T14:34:55Z</dc:date>
    </item>
    <item>
      <title>Problems with IB and IntelMPI</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823287#M1202</link>
      <description>&lt;P&gt;Hi ivcores,&lt;/P&gt;&lt;P&gt;Unfortunately, that's a pretty generic error. All it means is, one of your MPI processes failed, and that took down the entire application run.&lt;/P&gt;&lt;P&gt;This could be caused by the use of an older provider (similar to a &lt;A target="_blank" href="http://software.intel.com/en-us/forums/showthread.php?t=74020"&gt;different issue on this forum&lt;/A&gt;). Could I see the output of your &lt;B&gt;ibstat&lt;/B&gt; tool? Also, what version of OFED do you have installed (running &lt;B&gt;ofed_info&lt;/B&gt; should tell you). We recommend the latest &lt;A target="_blank" href="http://www.openfabrics.org/downloads/OFED/ofed-1.5.1/"&gt;OFED 1.5.1&lt;/A&gt;, if possible.&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;~Gergana&lt;/P&gt;</description>
      <pubDate>Tue, 11 May 2010 21:14:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Problems-with-IB-and-IntelMPI/m-p/823287#M1202</guid>
      <dc:creator>Gergana_S_Intel</dc:creator>
      <dc:date>2010-05-11T21:14:19Z</dc:date>
    </item>
  </channel>
</rss>

