SLURM 14.11.9 with MPI_Comm_accept causes Assertion failed when communicating

John_D_6 · ‎07-28-2016

Dear all,

I'd like to run different MPI-processes in a server/client-setup, as explained in:

https://software.intel.com/en-us/articles/using-the-intel-mpi-library-in-a-serverclient-setup

I have attached the two programs that I used as test:

attach.c opens a port and calls MPI_Comm_attach
connect.c calls MPI_Comm_connect and expects the port as argument
when the connection is setup, MPI_Allreduce is used to sum some integers

everything works fine when I start these programs interactively with mpirun:

[donners@int2 openport]$ mpirun -n 1 ./accept_c &
[3] 9575
[donners@int2 openport]$ ./accept_c: MPI_Open_port..
./accept_c: mpiport=tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$
./accept_c: MPI_Comm_Accept..


[3]+  Stopped                 mpirun -n 1 ./accept_c
[donners@int2 openport]$ mpirun -n 1 ./connect_c 'tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$'
./connect_c: Port name entered: tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$
./connect_c: MPI_Comm_connect..
./connect_c: Size of intercommunicator: 1
./connect_c: intercomm, MPI_Allreduce..
./accept_c: Size of intercommunicator: 1
./accept_c: intercomm, MPI_Allreduce..
./connect_c: intercomm, my_value=7 SUM=8
./accept_c: intercomm, my_value=8 SUM=7
./accept_c: intracomm, MPI_Allreduce..
./accept_c: intracomm, my_value=8 SUM=15
Done
./accept_c: Done
./connect_c: intracomm, MPI_Allreduce..
./connect_c: intracomm, my_value=7 SUM=15
Done

However, it fails when started by SLURM. The job script looks like:

#!/bin/bash
#SBATCH -n 2

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
tmp=$(mktemp)
srun -l -n 1 ./accept_c 2>&1 | tee $tmp &
until [ "$port" != "" ];do
  port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
  echo "Found port: $port"
  sleep 1
done
srun -l -n 1 ./connect_c "$port" <<EOF
$port
EOF

The output is:

Found port: 
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Open_port..
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: mpiport=tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Comm_Accept..
Found port: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
output connect_c: /scratch/nodespecific/srv4/donners.2300217/tmp.rQY9kHs8HS
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Port name entered: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: MPI_Comm_connect..
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: Size of intercommunicator: 1
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: intercomm, MPI_Allreduce..
0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id
0: internal ABORT - process 0
0: In: PMI_Abort(1, internal ABORT - process 0)
0: slurmstepd: *** STEP 2300217.0 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Size of intercommunicator: 1
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: intercomm, MPI_Allreduce..
0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id
0: internal ABORT - process 0
0: In: PMI_Abort(1, internal ABORT - process 0)
0: slurmstepd: *** STEP 2300217.1 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***

The server and client do connect, but fail when communication starts. This looks like a bug in the MPI-library.
Could you let me know if this is the case, or if this use is not supported by Intel MPI?

With regards,
John

Mark_L_Intel · ‎08-01-2016

Hello John,

I checked first using interactive session and it worked as you described. The sbatch worked too although. I added a simple parsing of the nodelist to make sure that client and server run on the different nodes.

#!/bin/bash
 #queue or -p
 #SBATCH --partition=debug
 #SBATCH --nodes=2
 
 #requested time
 #SBATCH --time=00:10:00
 
 #jobname or -J
 #SBATCH --job-name=server
 
 export I_MPI_DEBUG=5
 export NODELIST=nodelist.$$
 
 srun -l bash -c 'hostname' | sort | awk '{print $2}' > $NODELIST
 
 echo "nodelist"
 cat $NODELIST
 
 n=0
 for i in `cat ${NODELIST} | tr '.' '\n'` ; do
    mynodes=${i}
    let n=$n+1
 done
 
 echo "node=${mynodes[0]}"
 echo "node=${mynodes[1]}"
 

 export I_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so
 tmp=$(mktemp)
 
 srun -v -l --nodes=1-1 -n 1 -w ${mynodes[0]} ./accept_c 2>&1 | tee $tmp &
 
 until [ "$port" != "" ];do
 port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
 echo "Found port: $port"
 sleep 1
 done
 
 srun -v -l --nodes=1-1 -n 1  -w ${mynodes[1]} ./connect_c "$port" <<EOF
 $port

and the output:

SLURM_JOBID=2802169
 SLURM_JOB_NODELIST=nid00[148-149]
 SLURM_NNODES=2
 SLURMTMPDIR=
 working directory = /global/u2/m/mlubin/IDZ
 nodelist
 nid00148
 nid00149
 node=nid00148
 node=nid00149
 Found port:
 srun: defined options for program `srun'

....
 srun: launching 2802169.1 on host nid00148, 1 tasks: 0
 srun: route default plugin loaded
 srun: Node nid00148, 1 tasks started
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
0: [0] MPI startup(): Multi-threaded optimized library
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node

 0: [0] MPI startup(): shm and tcp data transfer modes
 0: [0] MPI startup(): Rank    Pid      Node name  Pin cpu
 0: [0] MPI startup(): 0       14543    nid00048   +1
 0: [0] MPI startup(): I_MPI_DEBUG=5
 0: [0] MPI startup(): I_MPI_FABRICS=shm:tcp
 0: [0] MPI startup(): I_MPI_FALLBACK=1


 0: /global/u2/m/mlubin/IDZ/./accept_c: MPI_Open_port..
 0: /global/u2/m/mlubin/IDZ/./accept_c: mpiport=tag#0$description#nid00148$port#65$ifname#10.128.0.149$
 0: /global/u2/m/mlubin/IDZ/./accept_c: MPI_Comm_Accept..
 Found port: tag#0$description#nid00148$port#65$ifname#10.128.0.149$
 /var/spool/slurmd/job2802169/slurm_script: line 54: warning: here-document at line 53 delimited by end-of-file (wanted `EOF')
 srun: defined options for program `srun'


....

srun: remote command    : `./connect_c tag#0$description#nid00148$port#65$ifname#10.128.0.149$'
 srun: Consumable Resources (CR) Node Selection plugin loaded with argument 50
 srun: launching 2802169.2 on host nid00149, 1 tasks: 0
 srun: route default plugin loaded
 srun: Node nid00149, 1 tasks started
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node
 srun: Sent KVS info to 1 nodes, up to 1 tasks per node

  0: [0] MPI startup(): Multi-threaded optimized library
 0: [0] MPI startup(): shm and tcp data transfer modes
 0: [0] MPI startup(): Rank    Pid      Node name  Pin cpu
 0: [0] MPI startup(): 0       20633    nid00049   +1
 0: [0] MPI startup(): I_MPI_DEBUG=5
 0: [0] MPI startup(): I_MPI_FABRICS=shm:tcp
 0: [0] MPI startup(): I_MPI_FALLBACK=1


 0: /global/u2/m/mlubin/IDZ/./connect_c: Port name entered: tag#0$description#nid00148$port#65$ifname#10.128.0.149$
 0: 0: /global/u2/m/mlubin/IDZ/./accept_c: Size of intercommunicator: 1
 0: /global/u2/m/mlubin/IDZ/./accept_c: intercomm, MPI_Allreduce..
 /global/u2/m/mlubin/IDZ/./connect_c: MPI_Comm_connect..
 0: /global/u2/m/mlubin/IDZ/./accept_c: intercomm, my_value=8 SUM=7
 0: /global/u2/m/mlubin/IDZ/./accept_c: intracomm, MPI_Allreduce..
 0: /global/u2/m/mlubin/IDZ/./accept_c: intracomm, my_value=8 SUM=15
 0: Done
 0: /global/u2/m/mlubin/IDZ/./accept_c: Done
 0: /global/u2/m/mlubin/IDZ/./connect_c: Size of intercommunicator: 1
 0: /global/u2/m/mlubin/IDZ/./connect_c: intercomm, MPI_Allreduce..
 0: /global/u2/m/mlubin/IDZ/./connect_c: intercomm, my_value=7 SUM=8
 0: /global/u2/m/mlubin/IDZ/./connect_c: intracomm, MPI_Allreduce..
 0: /global/u2/m/mlubin/IDZ/./connect_c: intracomm, my_value=7 SUM=15
 0: Done
 0:
 srun: Received task exit notification for 1 task (status=0x0000).
 srun: nid00149: task 0: Completed
 ~/IDZ/slurm-2802169.out   CWD: /global/u2/m/mlubin/IDZ   Line: 170

This was done using

ldd connect_c
        linux-vdso.so.1 (0x00002aaaaaaab000)
        libmpifort.so.12 => /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002aaaaaaaf000)
        libmpi.so.12 => /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/lib/libmpi.so.12 (0x00002aaaaae4d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab673000)
        librt.so.1 => /lib64/librt.so.1 (0x00002aaaab878000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaba80000)
        libm.so.6 => /lib64/libm.so.6 (0x00002aaaabc9d000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaabf9f000)
        libc.so.6 => /lib64/libc.so.6 (0x00002aaaac1b6000)
        /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)

It looks like one of the nodes does not clean up after itself although -- not sure why.

Mark

John_D_6 · ‎08-02-2016

Dear Mark,

thank you for checking the test on a different system. It's good to see that this setup can work in combination with SLURM. I see from your output that I_MPI_FABRICS is set to shm:tcp, so it doesn't use the InfiniBand network. The MPI-library fails on our system in DAPL, so maybe it's specific to InfiniBand? Unfortunately, I can't check if tcp works on our system, since it's down for the coming two weeks..

Mark_L_Intel · ‎08-02-2016

Hello John,

Indeed, this simple code may not work for IB. If you need a faster communication and you would like to use the same code example, you can set up IPoIB:

export I_MPI_FABRICS=shm:tcp
export I_MPI_TCP_NETMASK=ib

I just checked and it worked. You won't get the same speeds as with real IB but it should be faster then TCP.

In case if you want "real" IB, you may need to use MPI_PUBLISH_NAME/MPI_LOOKUP_NAME (hydra_nameserver needs to be started):

https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

I did not try it myself, but I see some references:

http://stackoverflow.com/questions/14210558/mpich-how-to-publish-name-such-that-a-client-application-can-lookup-name-it

http://mpi.deino.net/mpi_functions/MPI_Lookup_name.html

In case you try and succeed, I would be interested to see a final code. The problem with SLURM and using 3rd party PMI (which we do recommend in case of SLURM) is that it might introduce additional complications with hydra_nameserver -or not - this approach needs more experiments.

Thanks,

Mark

John_D_6 · ‎09-06-2016

Hello Mark,

thanks for the suggestions. I didn't get this to work with srun, with or without the hydra_nameserver. When I replace srun with mpiexec.hydra, it can connect to processes run in the same job or in another job. Here's a job script that connects MPI-processes in two separate SLURM jobs:

#!/bin/bash
#
# This job script:
#  -starts a process that opens a port and prints it
#  -submits itself with the word 'connect' and the port as arguments
#  -the second job opens the port
#  -the processes communicate a bit and stop
#
#SBATCH -N 1
#SBATCH -n 1

export I_MPI_DEBUG=2
export I_MPI_PMI_LIBRARY=none

tmp=$(mktemp)
echo "output accept_c: $(readlink -f $tmp)"

if [ "$1" != "connect" ]; then
  sleep 5
  mpiexec.hydra -bootstrap srun -n 1 ./accept_c 2>&1 | tee $tmp &
  until [ "$port" != "" ];do 
    port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
    echo "Found port: $port"
    sleep 1
  done
  sbatch $0 connect $port
else 
  sleep 3
  mpiexec.hydra -bootstrap srun -n 1 ./connect_c $2 &
fi

wait

this works fine over InfiniBand, which is good.

However, there's still some issues which I can't resolve after many tests:

I have to set I_MPI_PMI_LIBRARY to some non-existing file, otherwise it starts multiple tasks, each being rank 0 of a separate MPI_COMM_WORLD. You can test this by increasing the number of tasks for connect_c in the above job. I'm not sure why that happens (maybe this issue only exists on our cluster?).
The tasks are distributed across nodes, but within a node the tasks are all bound to core 0. This is of course not good, since the nodes each have 24 cores in our case.