<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hello Mark, in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075681#M4771</link>
    <description>&lt;P&gt;Hello Mark,&lt;/P&gt;

&lt;P&gt;thanks for the suggestions. I didn't get this to work with srun, with or without the hydra_nameserver. When I replace srun with mpiexec.hydra, it can connect to processes run in the same job or in another job. Here's a job script that connects MPI-processes in two separate SLURM jobs:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;#!/bin/bash
#
# This job script:
#  -starts a process that opens a port and prints it
#  -submits itself with the word 'connect' and the port as arguments
#  -the second job opens the port
#  -the processes communicate a bit and stop
#
#SBATCH -N 1
#SBATCH -n 1

export I_MPI_DEBUG=2
export I_MPI_PMI_LIBRARY=none

tmp=$(mktemp)
echo "output accept_c: $(readlink -f $tmp)"

if [ "$1" != "connect" ]; then
  sleep 5
  mpiexec.hydra -bootstrap srun -n 1 ./accept_c 2&amp;gt;&amp;amp;1 | tee $tmp &amp;amp;
  until [ "$port" != "" ];do 
    port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
    echo "Found port: $port"
    sleep 1
  done
  sbatch $0 connect $port
else 
  sleep 3
  mpiexec.hydra -bootstrap srun -n 1 ./connect_c $2 &amp;amp;
fi

wait&lt;/PRE&gt;

&lt;P&gt;this works fine over InfiniBand, which is good.&lt;/P&gt;

&lt;P&gt;However, there's still some issues which I can't resolve after many tests:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;I have to set I_MPI_PMI_LIBRARY to some non-existing file, otherwise it starts multiple tasks, each being rank 0 of a separate MPI_COMM_WORLD. You can test this by increasing the number of tasks for connect_c in the above job. I'm not sure why that happens (maybe this issue only exists on our cluster?).&lt;/LI&gt;
	&lt;LI&gt;The tasks are distributed across nodes, but within a node the tasks are all bound to core 0. This is of course not good, since the nodes each have 24 cores in our case.&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Tue, 06 Sep 2016 15:27:47 GMT</pubDate>
    <dc:creator>John_D_6</dc:creator>
    <dc:date>2016-09-06T15:27:47Z</dc:date>
    <item>
      <title>SLURM 14.11.9 with MPI_Comm_accept causes Assertion failed when communicating</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075677#M4767</link>
      <description>&lt;P&gt;Dear all,&lt;/P&gt;

&lt;P&gt;I'd like to run different MPI-processes in a server/client-setup, as explained in:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/using-the-intel-mpi-library-in-a-serverclient-setup"&gt;https://software.intel.com/en-us/articles/using-the-intel-mpi-library-in-a-serverclient-setup&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;I have attached the two programs that I used as test:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;attach.c opens a port and calls MPI_Comm_attach&lt;/LI&gt;
	&lt;LI&gt;connect.c calls MPI_Comm_connect and expects the port as argument&lt;/LI&gt;
	&lt;LI&gt;when the connection is setup, MPI_Allreduce is used to sum some integers&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;everything works fine when I start these programs interactively with mpirun:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;[donners@int2 openport]$ mpirun -n 1 ./accept_c &amp;amp;
[3] 9575
[donners@int2 openport]$ ./accept_c: MPI_Open_port..
./accept_c: mpiport=tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$
./accept_c: MPI_Comm_Accept..


[3]+  Stopped                 mpirun -n 1 ./accept_c
[donners@int2 openport]$ mpirun -n 1 ./connect_c 'tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$'
./connect_c: Port name entered: tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$
./connect_c: MPI_Comm_connect..
./connect_c: Size of intercommunicator: 1
./connect_c: intercomm, MPI_Allreduce..
./accept_c: Size of intercommunicator: 1
./accept_c: intercomm, MPI_Allreduce..
./connect_c: intercomm, my_value=7 SUM=8
./accept_c: intercomm, my_value=8 SUM=7
./accept_c: intracomm, MPI_Allreduce..
./accept_c: intracomm, my_value=8 SUM=15
Done
./accept_c: Done
./connect_c: intracomm, MPI_Allreduce..
./connect_c: intracomm, my_value=7 SUM=15
Done&lt;/PRE&gt;

&lt;P&gt;However, it fails when started by SLURM. The job script looks like:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;#!/bin/bash
#SBATCH -n 2

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
tmp=$(mktemp)
srun -l -n 1 ./accept_c 2&amp;gt;&amp;amp;1 | tee $tmp &amp;amp;
until [ "$port" != "" ];do
  port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
  echo "Found port: $port"
  sleep 1
done
srun -l -n 1 ./connect_c "$port" &amp;lt;&amp;lt;EOF
$port
EOF&lt;/PRE&gt;

&lt;P&gt;The output is:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;Found port: 
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Open_port..
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: mpiport=tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Comm_Accept..
Found port: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
output connect_c: /scratch/nodespecific/srv4/donners.2300217/tmp.rQY9kHs8HS
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Port name entered: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: MPI_Comm_connect..
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: Size of intercommunicator: 1
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: intercomm, MPI_Allreduce..
0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr &amp;amp;&amp;amp; ptr == (char*) MPIDI_Process.my_pg-&amp;gt;id
0: internal ABORT - process 0
0: In: PMI_Abort(1, internal ABORT - process 0)
0: slurmstepd: *** STEP 2300217.0 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Size of intercommunicator: 1
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: intercomm, MPI_Allreduce..
0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr &amp;amp;&amp;amp; ptr == (char*) MPIDI_Process.my_pg-&amp;gt;id
0: internal ABORT - process 0
0: In: PMI_Abort(1, internal ABORT - process 0)
0: slurmstepd: *** STEP 2300217.1 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***
&lt;/PRE&gt;

&lt;P&gt;The server and client do connect, but fail when communication starts. This looks like a bug in the MPI-library.&lt;BR /&gt;
	Could you let me know if this is the case, or if this use is not supported by Intel MPI?&lt;/P&gt;

&lt;P&gt;With regards,&lt;BR /&gt;
	John&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jul 2016 12:02:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075677#M4767</guid>
      <dc:creator>John_D_6</dc:creator>
      <dc:date>2016-07-28T12:02:39Z</dc:date>
    </item>
    <item>
      <title>Hello John,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075678#M4768</link>
      <description>&lt;P&gt;Hello John,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I checked first using interactive session and it worked as you described. The sbatch worked too although. I added a simple parsing of the nodelist to make sure that&amp;nbsp;client and server run on the different nodes. &amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;#!/bin/bash
&amp;nbsp;#queue or -p
&amp;nbsp;#SBATCH --partition=debug
&amp;nbsp;#SBATCH --nodes=2
&amp;nbsp;
&amp;nbsp;#requested time
&amp;nbsp;#SBATCH --time=00:10:00
&amp;nbsp;
&amp;nbsp;#jobname or -J
&amp;nbsp;#SBATCH --job-name=server
&amp;nbsp;
 export I_MPI_DEBUG=5
&amp;nbsp;export NODELIST=nodelist.$$
&amp;nbsp;
&amp;nbsp;srun -l bash -c 'hostname' | sort | awk '{print $2}' &amp;gt; $NODELIST
&amp;nbsp;
&amp;nbsp;echo "nodelist"
&amp;nbsp;cat $NODELIST
&amp;nbsp;
&amp;nbsp;n=0
&amp;nbsp;for i in `cat ${NODELIST} | tr '.' '\n'` ; do
&amp;nbsp;&amp;nbsp;&amp;nbsp; mynodes&lt;N&gt;=${i}
&amp;nbsp;&amp;nbsp;&amp;nbsp; let n=$n+1
&amp;nbsp;done
&amp;nbsp;
&amp;nbsp;echo "node=${mynodes[0]}"
&amp;nbsp;echo "node=${mynodes[1]}"
&amp;nbsp;

&amp;nbsp;export I_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so
&amp;nbsp;tmp=$(mktemp)
&amp;nbsp;
&amp;nbsp;srun -v -l --nodes=1-1 -n 1 -w ${mynodes[0]} ./accept_c 2&amp;gt;&amp;amp;1 | tee $tmp &amp;amp;
&amp;nbsp;
&amp;nbsp;until [ "$port" != "" ];do
&amp;nbsp;port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
&amp;nbsp;echo "Found port: $port"
&amp;nbsp;sleep 1
&amp;nbsp;done
&amp;nbsp;
&amp;nbsp;srun -v -l --nodes=1-1 -n 1&amp;nbsp; -w ${mynodes[1]} ./connect_c "$port" &amp;lt;&amp;lt;EOF
&amp;nbsp;$port
&lt;/N&gt;&lt;/PRE&gt;

&lt;P&gt;and the output:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;SLURM_JOBID=2802169
&amp;nbsp;SLURM_JOB_NODELIST=nid00[148-149]
&amp;nbsp;SLURM_NNODES=2
&amp;nbsp;SLURMTMPDIR=
&amp;nbsp;working directory = /global/u2/m/mlubin/IDZ
&amp;nbsp;nodelist
&amp;nbsp;nid00148
&amp;nbsp;nid00149
&amp;nbsp;node=nid00148
&amp;nbsp;node=nid00149
&amp;nbsp;Found port:
&amp;nbsp;srun: defined options for program `srun'

....
&amp;nbsp;srun: launching 2802169.1 on host nid00148, 1 tasks: 0
&amp;nbsp;srun: route default plugin loaded
&amp;nbsp;srun: Node nid00148, 1 tasks started
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
0: [0] MPI startup(): Multi-threaded optimized library
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node

&amp;nbsp;0: [0] MPI startup(): shm and tcp data transfer modes
&amp;nbsp;0: [0] MPI startup(): Rank&amp;nbsp;&amp;nbsp;&amp;nbsp; Pid&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Node name&amp;nbsp; Pin cpu
&amp;nbsp;0: [0] MPI startup(): 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 14543&amp;nbsp;&amp;nbsp;&amp;nbsp; nid00048&amp;nbsp;&amp;nbsp; +1
&amp;nbsp;0: [0] MPI startup(): I_MPI_DEBUG=5
&amp;nbsp;0: [0] MPI startup(): I_MPI_FABRICS=shm:tcp
&amp;nbsp;0: [0] MPI startup(): I_MPI_FALLBACK=1


&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: MPI_Open_port..
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: mpiport=tag#0$description#nid00148$port#65$ifname#10.128.0.149$
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: MPI_Comm_Accept..
&amp;nbsp;Found port: tag#0$description#nid00148$port#65$ifname#10.128.0.149$
&amp;nbsp;/var/spool/slurmd/job2802169/slurm_script: line 54: warning: here-document at line 53 delimited by end-of-file (wanted `EOF')
&amp;nbsp;srun: defined options for program `srun'


....

srun: remote command&amp;nbsp;&amp;nbsp;&amp;nbsp; : `./connect_c tag#0$description#nid00148$port#65$ifname#10.128.0.149$'
&amp;nbsp;srun: Consumable Resources (CR) Node Selection plugin loaded with argument 50
&amp;nbsp;srun: launching 2802169.2 on host nid00149, 1 tasks: 0
&amp;nbsp;srun: route default plugin loaded
&amp;nbsp;srun: Node nid00149, 1 tasks started
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node
&amp;nbsp;srun: Sent KVS info to 1 nodes, up to 1 tasks per node

  0: [0] MPI startup(): Multi-threaded optimized library
&amp;nbsp;0: [0] MPI startup(): shm and tcp data transfer modes
&amp;nbsp;0: [0] MPI startup(): Rank&amp;nbsp;&amp;nbsp;&amp;nbsp; Pid&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Node name&amp;nbsp; Pin cpu
&amp;nbsp;0: [0] MPI startup(): 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 20633&amp;nbsp;&amp;nbsp;&amp;nbsp; nid00049&amp;nbsp;&amp;nbsp; +1
&amp;nbsp;0: [0] MPI startup(): I_MPI_DEBUG=5
&amp;nbsp;0: [0] MPI startup(): I_MPI_FABRICS=shm:tcp
&amp;nbsp;0: [0] MPI startup(): I_MPI_FALLBACK=1


&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./connect_c: Port name entered: tag#0$description#nid00148$port#65$ifname#10.128.0.149$
&amp;nbsp;0: 0: /global/u2/m/mlubin/IDZ/./accept_c: Size of intercommunicator: 1
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: intercomm, MPI_Allreduce..
&amp;nbsp;/global/u2/m/mlubin/IDZ/./connect_c: MPI_Comm_connect..
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: intercomm, my_value=8 SUM=7
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: intracomm, MPI_Allreduce..
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: intracomm, my_value=8 SUM=15
&amp;nbsp;0: Done
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./accept_c: Done
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./connect_c: Size of intercommunicator: 1
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./connect_c: intercomm, MPI_Allreduce..
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./connect_c: intercomm, my_value=7 SUM=8
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./connect_c: intracomm, MPI_Allreduce..
&amp;nbsp;0: /global/u2/m/mlubin/IDZ/./connect_c: intracomm, my_value=7 SUM=15
&amp;nbsp;0: Done
&amp;nbsp;0:
&amp;nbsp;srun: Received task exit notification for 1 task (status=0x0000).
&amp;nbsp;srun: nid00149: task 0: Completed
&amp;nbsp;~/IDZ/slurm-2802169.out&amp;nbsp;&amp;nbsp; CWD: /global/u2/m/mlubin/IDZ&amp;nbsp;&amp;nbsp; Line: 170&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 
&lt;/PRE&gt;

&lt;P&gt;This was done using&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;ldd connect_c
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; linux-vdso.so.1 (0x00002aaaaaaab000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libmpifort.so.12 =&amp;gt; /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002aaaaaaaf000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libmpi.so.12 =&amp;gt; /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/lib/libmpi.so.12 (0x00002aaaaae4d000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libdl.so.2 =&amp;gt; /lib64/libdl.so.2 (0x00002aaaab673000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; librt.so.1 =&amp;gt; /lib64/librt.so.1 (0x00002aaaab878000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libpthread.so.0 =&amp;gt; /lib64/libpthread.so.0 (0x00002aaaaba80000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libm.so.6 =&amp;gt; /lib64/libm.so.6 (0x00002aaaabc9d000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libgcc_s.so.1 =&amp;gt; /lib64/libgcc_s.so.1 (0x00002aaaabf9f000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; libc.so.6 =&amp;gt; /lib64/libc.so.6 (0x00002aaaac1b6000)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
&lt;/PRE&gt;

&lt;P&gt;It looks like one of the nodes does not clean up after itself although --&amp;nbsp;not sure why.&lt;/P&gt;

&lt;P&gt;Mark&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 03:10:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075678#M4768</guid>
      <dc:creator>Mark_L_Intel</dc:creator>
      <dc:date>2016-08-02T03:10:49Z</dc:date>
    </item>
    <item>
      <title>Dear Mark,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075679#M4769</link>
      <description>&lt;P&gt;Dear Mark,&lt;/P&gt;

&lt;P&gt;thank you for checking the test on a different system. It's good to see that this setup can work in combination with SLURM. I see from your output that I_MPI_FABRICS is set to shm:tcp, so it doesn't use the InfiniBand network. The MPI-library fails on our system in DAPL, so maybe it's specific to InfiniBand? Unfortunately, I can't check if tcp works on our system, since it's down for the coming two weeks..&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 09:11:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075679#M4769</guid>
      <dc:creator>John_D_6</dc:creator>
      <dc:date>2016-08-02T09:11:54Z</dc:date>
    </item>
    <item>
      <title>Hello John,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075680#M4770</link>
      <description>&lt;P&gt;Hello John,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Indeed, this simple code may not work for IB. If you need a&amp;nbsp;faster communication and you would like to&amp;nbsp;use the same&amp;nbsp;code example, you can set up IPoIB:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;export I_MPI_FABRICS=shm:tcp
export I_MPI_TCP_NETMASK=ib&lt;/PRE&gt;

&lt;P&gt;I just checked and it worked.&amp;nbsp;You won't get the same speeds as with&amp;nbsp;real IB but it should be faster then TCP.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;In case if you want "real" IB, you may need to use &lt;SPAN style="color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: &amp;quot;Times New Roman&amp;quot;; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"&gt;MPI_PUBLISH_NAME/MPI_LOOKUP_NAME (hydra_nameserver needs to be started):&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: &amp;quot;Times New Roman&amp;quot;; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"&gt;&lt;A href="https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager"&gt;&lt;U&gt;&lt;FONT color="#0563c1"&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;A href="https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager" target="_blank"&gt;https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I did not try it myself, but I see some references:&lt;/P&gt;

&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/14210558/mpich-how-to-publish-name-such-that-a-client-application-can-lookup-name-it"&gt;http://stackoverflow.com/questions/14210558/mpich-how-to-publish-name-such-that-a-client-application-can-lookup-name-it&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&lt;A href="http://mpi.deino.net/mpi_functions/MPI_Lookup_name.html"&gt;http://mpi.deino.net/mpi_functions/MPI_Lookup_name.html&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;In case you try and succeed, I would be interested to see a final&amp;nbsp; code. The problem with SLURM&amp;nbsp;and using 3rd party PMI (which we do recommend in case of SLURM)&amp;nbsp;is that it might introduce additional complications with hydra_nameserver&amp;nbsp; -or not -&amp;nbsp;this approach&amp;nbsp;needs more experiments.&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;Mark&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 18:03:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075680#M4770</guid>
      <dc:creator>Mark_L_Intel</dc:creator>
      <dc:date>2016-08-02T18:03:21Z</dc:date>
    </item>
    <item>
      <title>Hello Mark,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075681#M4771</link>
      <description>&lt;P&gt;Hello Mark,&lt;/P&gt;

&lt;P&gt;thanks for the suggestions. I didn't get this to work with srun, with or without the hydra_nameserver. When I replace srun with mpiexec.hydra, it can connect to processes run in the same job or in another job. Here's a job script that connects MPI-processes in two separate SLURM jobs:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;#!/bin/bash
#
# This job script:
#  -starts a process that opens a port and prints it
#  -submits itself with the word 'connect' and the port as arguments
#  -the second job opens the port
#  -the processes communicate a bit and stop
#
#SBATCH -N 1
#SBATCH -n 1

export I_MPI_DEBUG=2
export I_MPI_PMI_LIBRARY=none

tmp=$(mktemp)
echo "output accept_c: $(readlink -f $tmp)"

if [ "$1" != "connect" ]; then
  sleep 5
  mpiexec.hydra -bootstrap srun -n 1 ./accept_c 2&amp;gt;&amp;amp;1 | tee $tmp &amp;amp;
  until [ "$port" != "" ];do 
    port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
    echo "Found port: $port"
    sleep 1
  done
  sbatch $0 connect $port
else 
  sleep 3
  mpiexec.hydra -bootstrap srun -n 1 ./connect_c $2 &amp;amp;
fi

wait&lt;/PRE&gt;

&lt;P&gt;this works fine over InfiniBand, which is good.&lt;/P&gt;

&lt;P&gt;However, there's still some issues which I can't resolve after many tests:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;I have to set I_MPI_PMI_LIBRARY to some non-existing file, otherwise it starts multiple tasks, each being rank 0 of a separate MPI_COMM_WORLD. You can test this by increasing the number of tasks for connect_c in the above job. I'm not sure why that happens (maybe this issue only exists on our cluster?).&lt;/LI&gt;
	&lt;LI&gt;The tasks are distributed across nodes, but within a node the tasks are all bound to core 0. This is of course not good, since the nodes each have 24 cores in our case.&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 06 Sep 2016 15:27:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/SLURM-14-11-9-with-MPI-Comm-accept-causes-Assertion-failed-when/m-p/1075681#M4771</guid>
      <dc:creator>John_D_6</dc:creator>
      <dc:date>2016-09-06T15:27:47Z</dc:date>
    </item>
  </channel>
</rss>

