- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I'd like to run different MPI-processes in a server/client-setup, as explained in:
https://software.intel.com/en-us/articles/using-the-intel-mpi-library-in-a-serverclient-setup
I have attached the two programs that I used as test:
- attach.c opens a port and calls MPI_Comm_attach
- connect.c calls MPI_Comm_connect and expects the port as argument
- when the connection is setup, MPI_Allreduce is used to sum some integers
everything works fine when I start these programs interactively with mpirun:
[donners@int2 openport]$ mpirun -n 1 ./accept_c & [3] 9575 [donners@int2 openport]$ ./accept_c: MPI_Open_port.. ./accept_c: mpiport=tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$ ./accept_c: MPI_Comm_Accept.. [3]+ Stopped mpirun -n 1 ./accept_c [donners@int2 openport]$ mpirun -n 1 ./connect_c 'tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$' ./connect_c: Port name entered: tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$ ./connect_c: MPI_Comm_connect.. ./connect_c: Size of intercommunicator: 1 ./connect_c: intercomm, MPI_Allreduce.. ./accept_c: Size of intercommunicator: 1 ./accept_c: intercomm, MPI_Allreduce.. ./connect_c: intercomm, my_value=7 SUM=8 ./accept_c: intercomm, my_value=8 SUM=7 ./accept_c: intracomm, MPI_Allreduce.. ./accept_c: intracomm, my_value=8 SUM=15 Done ./accept_c: Done ./connect_c: intracomm, MPI_Allreduce.. ./connect_c: intracomm, my_value=7 SUM=15 Done
However, it fails when started by SLURM. The job script looks like:
#!/bin/bash #SBATCH -n 2 export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so tmp=$(mktemp) srun -l -n 1 ./accept_c 2>&1 | tee $tmp & until [ "$port" != "" ];do port=$(cat $tmp|fgrep mpiport|cut -d= -f2-) echo "Found port: $port" sleep 1 done srun -l -n 1 ./connect_c "$port" <<EOF $port EOF
The output is:
Found port: 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Open_port.. 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: mpiport=tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$ 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Comm_Accept.. Found port: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$ output connect_c: /scratch/nodespecific/srv4/donners.2300217/tmp.rQY9kHs8HS 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Port name entered: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$ 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: MPI_Comm_connect.. 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: Size of intercommunicator: 1 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: intercomm, MPI_Allreduce.. 0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id 0: internal ABORT - process 0 0: In: PMI_Abort(1, internal ABORT - process 0) 0: slurmstepd: *** STEP 2300217.0 ON srv4 CANCELLED AT 2016-07-28T13:41:40 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Size of intercommunicator: 1 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: intercomm, MPI_Allreduce.. 0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id 0: internal ABORT - process 0 0: In: PMI_Abort(1, internal ABORT - process 0) 0: slurmstepd: *** STEP 2300217.1 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***
The server and client do connect, but fail when communication starts. This looks like a bug in the MPI-library.
Could you let me know if this is the case, or if this use is not supported by Intel MPI?
With regards,
John
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello John,
I checked first using interactive session and it worked as you described. The sbatch worked too although. I added a simple parsing of the nodelist to make sure that client and server run on the different nodes.
#!/bin/bash #queue or -p #SBATCH --partition=debug #SBATCH --nodes=2 #requested time #SBATCH --time=00:10:00 #jobname or -J #SBATCH --job-name=server export I_MPI_DEBUG=5 export NODELIST=nodelist.$$ srun -l bash -c 'hostname' | sort | awk '{print $2}' > $NODELIST echo "nodelist" cat $NODELIST n=0 for i in `cat ${NODELIST} | tr '.' '\n'` ; do mynodes=${i} let n=$n+1 done echo "node=${mynodes[0]}" echo "node=${mynodes[1]}" export I_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so tmp=$(mktemp) srun -v -l --nodes=1-1 -n 1 -w ${mynodes[0]} ./accept_c 2>&1 | tee $tmp & until [ "$port" != "" ];do port=$(cat $tmp|fgrep mpiport|cut -d= -f2-) echo "Found port: $port" sleep 1 done srun -v -l --nodes=1-1 -n 1 -w ${mynodes[1]} ./connect_c "$port" <<EOF $port
and the output:
SLURM_JOBID=2802169 SLURM_JOB_NODELIST=nid00[148-149] SLURM_NNODES=2 SLURMTMPDIR= working directory = /global/u2/m/mlubin/IDZ nodelist nid00148 nid00149 node=nid00148 node=nid00149 Found port: srun: defined options for program `srun' .... srun: launching 2802169.1 on host nid00148, 1 tasks: 0 srun: route default plugin loaded srun: Node nid00148, 1 tasks started srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node 0: [0] MPI startup(): Multi-threaded optimized library srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node 0: [0] MPI startup(): shm and tcp data transfer modes 0: [0] MPI startup(): Rank Pid Node name Pin cpu 0: [0] MPI startup(): 0 14543 nid00048 +1 0: [0] MPI startup(): I_MPI_DEBUG=5 0: [0] MPI startup(): I_MPI_FABRICS=shm:tcp 0: [0] MPI startup(): I_MPI_FALLBACK=1 0: /global/u2/m/mlubin/IDZ/./accept_c: MPI_Open_port.. 0: /global/u2/m/mlubin/IDZ/./accept_c: mpiport=tag#0$description#nid00148$port#65$ifname#10.128.0.149$ 0: /global/u2/m/mlubin/IDZ/./accept_c: MPI_Comm_Accept.. Found port: tag#0$description#nid00148$port#65$ifname#10.128.0.149$ /var/spool/slurmd/job2802169/slurm_script: line 54: warning: here-document at line 53 delimited by end-of-file (wanted `EOF') srun: defined options for program `srun' .... srun: remote command : `./connect_c tag#0$description#nid00148$port#65$ifname#10.128.0.149$' srun: Consumable Resources (CR) Node Selection plugin loaded with argument 50 srun: launching 2802169.2 on host nid00149, 1 tasks: 0 srun: route default plugin loaded srun: Node nid00149, 1 tasks started srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node srun: Sent KVS info to 1 nodes, up to 1 tasks per node 0: [0] MPI startup(): Multi-threaded optimized library 0: [0] MPI startup(): shm and tcp data transfer modes 0: [0] MPI startup(): Rank Pid Node name Pin cpu 0: [0] MPI startup(): 0 20633 nid00049 +1 0: [0] MPI startup(): I_MPI_DEBUG=5 0: [0] MPI startup(): I_MPI_FABRICS=shm:tcp 0: [0] MPI startup(): I_MPI_FALLBACK=1 0: /global/u2/m/mlubin/IDZ/./connect_c: Port name entered: tag#0$description#nid00148$port#65$ifname#10.128.0.149$ 0: 0: /global/u2/m/mlubin/IDZ/./accept_c: Size of intercommunicator: 1 0: /global/u2/m/mlubin/IDZ/./accept_c: intercomm, MPI_Allreduce.. /global/u2/m/mlubin/IDZ/./connect_c: MPI_Comm_connect.. 0: /global/u2/m/mlubin/IDZ/./accept_c: intercomm, my_value=8 SUM=7 0: /global/u2/m/mlubin/IDZ/./accept_c: intracomm, MPI_Allreduce.. 0: /global/u2/m/mlubin/IDZ/./accept_c: intracomm, my_value=8 SUM=15 0: Done 0: /global/u2/m/mlubin/IDZ/./accept_c: Done 0: /global/u2/m/mlubin/IDZ/./connect_c: Size of intercommunicator: 1 0: /global/u2/m/mlubin/IDZ/./connect_c: intercomm, MPI_Allreduce.. 0: /global/u2/m/mlubin/IDZ/./connect_c: intercomm, my_value=7 SUM=8 0: /global/u2/m/mlubin/IDZ/./connect_c: intracomm, MPI_Allreduce.. 0: /global/u2/m/mlubin/IDZ/./connect_c: intracomm, my_value=7 SUM=15 0: Done 0: srun: Received task exit notification for 1 task (status=0x0000). srun: nid00149: task 0: Completed ~/IDZ/slurm-2802169.out CWD: /global/u2/m/mlubin/IDZ Line: 170
This was done using
ldd connect_c linux-vdso.so.1 (0x00002aaaaaaab000) libmpifort.so.12 => /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002aaaaaaaf000) libmpi.so.12 => /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/lib/libmpi.so.12 (0x00002aaaaae4d000) libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab673000) librt.so.1 => /lib64/librt.so.1 (0x00002aaaab878000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaba80000) libm.so.6 => /lib64/libm.so.6 (0x00002aaaabc9d000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaabf9f000) libc.so.6 => /lib64/libc.so.6 (0x00002aaaac1b6000) /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
It looks like one of the nodes does not clean up after itself although -- not sure why.
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mark,
thank you for checking the test on a different system. It's good to see that this setup can work in combination with SLURM. I see from your output that I_MPI_FABRICS is set to shm:tcp, so it doesn't use the InfiniBand network. The MPI-library fails on our system in DAPL, so maybe it's specific to InfiniBand? Unfortunately, I can't check if tcp works on our system, since it's down for the coming two weeks..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello John,
Indeed, this simple code may not work for IB. If you need a faster communication and you would like to use the same code example, you can set up IPoIB:
export I_MPI_FABRICS=shm:tcp export I_MPI_TCP_NETMASK=ib
I just checked and it worked. You won't get the same speeds as with real IB but it should be faster then TCP.
In case if you want "real" IB, you may need to use MPI_PUBLISH_NAME/MPI_LOOKUP_NAME (hydra_nameserver needs to be started):
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
I did not try it myself, but I see some references:
http://mpi.deino.net/mpi_functions/MPI_Lookup_name.html
In case you try and succeed, I would be interested to see a final code. The problem with SLURM and using 3rd party PMI (which we do recommend in case of SLURM) is that it might introduce additional complications with hydra_nameserver -or not - this approach needs more experiments.
Thanks,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Mark,
thanks for the suggestions. I didn't get this to work with srun, with or without the hydra_nameserver. When I replace srun with mpiexec.hydra, it can connect to processes run in the same job or in another job. Here's a job script that connects MPI-processes in two separate SLURM jobs:
#!/bin/bash # # This job script: # -starts a process that opens a port and prints it # -submits itself with the word 'connect' and the port as arguments # -the second job opens the port # -the processes communicate a bit and stop # #SBATCH -N 1 #SBATCH -n 1 export I_MPI_DEBUG=2 export I_MPI_PMI_LIBRARY=none tmp=$(mktemp) echo "output accept_c: $(readlink -f $tmp)" if [ "$1" != "connect" ]; then sleep 5 mpiexec.hydra -bootstrap srun -n 1 ./accept_c 2>&1 | tee $tmp & until [ "$port" != "" ];do port=$(cat $tmp|fgrep mpiport|cut -d= -f2-) echo "Found port: $port" sleep 1 done sbatch $0 connect $port else sleep 3 mpiexec.hydra -bootstrap srun -n 1 ./connect_c $2 & fi wait
this works fine over InfiniBand, which is good.
However, there's still some issues which I can't resolve after many tests:
- I have to set I_MPI_PMI_LIBRARY to some non-existing file, otherwise it starts multiple tasks, each being rank 0 of a separate MPI_COMM_WORLD. You can test this by increasing the number of tasks for connect_c in the above job. I'm not sure why that happens (maybe this issue only exists on our cluster?).
- The tasks are distributed across nodes, but within a node the tasks are all bound to core 0. This is of course not good, since the nodes each have 24 cores in our case.
![](/skins/images/8BF007DC1DD2BCA37A7694F2230E6FDE/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page