I've read through a couple of other threads with a similar issue but I think I'm not quite able to get this thing running. It could be that I have several IB interfaces and I didn't see an option for specifying which to use.
In a nutshell, I've compiled and am trying to run Intel's MP_LINPACK.
The run starts, and exits with the retries exhausted (as shown here): This run was done on: Thu May 19 14:18:55 MDT 2011  MPI startup(): RDMA data transfer mode  MPI startup(): RDMA data transfer mode  MPI startup(): RDMA data transfer mode  MPI startup(): RDMA data transfer mode  MPI startup(): DAPL provider OpenIB-cma  MPI startup(): dapl data transfer mode  MPI startup(): DAPL provider OpenIB-cma  MPI startup(): dapl data transfer mode  MPI startup(): DAPL provider OpenIB-cma  MPI startup(): dapl data transfer mode  MPI startup(): DAPL provider OpenIB-cma  MPI startup(): dapl data transfer mode  MPI startup(): static connections storm algo pg-v2:11063: dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 10.155.90.13,4866 pg-v2:11063: dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 10.155.90.12,4755 [0:pg73-v2] unexpected DAPL event 0x4008 Assertion failed in file ../../dapl_module_init.c at line 4226: 0 internal ABORT - process 0 rank 0 in job 8 pg73-v2_55194 caused collective abort of all ranks exit status of rank 0: return code 1 Done: Thu May 19 14:19:56 MDT 2011
This would tell me, I believe, that I cannot reach the other nodes. I have the following environment variables: I_MPI_DEBUG=5 I_MPI_DEVICE=rdma:OpenIB-cma I_MPI_FABRICS_LIST=ofa,dapl I_MPI_MPD_RSH=ssh I_MPI_ROOT=/opt/intel/impi/4.0.1.007
I'm actually trying to use ib3 rather than ib0, ib0 is down at the moment, so I wonder if that has something to do with the issue?
Here is the ibstat info relative to the ib3 port: CA 'mlx4_3' CA type: MT26428 Number of ports: 1 Firmware version: 2.8.600 Hardware version: a0 Node GUID: 0x0002c903000bb0dc System image GUID: 0x0002c903000bb0df Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 24 LMC: 0 SM lid: 2 Capability mask: 0x02510868 Port GUID: 0x0002c903000bb0dd Link layer: IB
I'm at MLNX_OFED_LINUX-1.5.2-2.1.0 (OFED-1.5.2-20101219-1546):
I'm using, basically, the default runme_intel64 script with a couple of minor changes
#!/bin/bash # echo "This is a SAMPLE run script. Change it to reflect the correct number" echo "of CPUs/threads, number of nodes, MPI processes per node, etc.." #
# # You can find description of all Intel MPI parameters in the # Intel MPI Reference Manual. # See /doc/Reference_manual.pdf #
export I_MPI_EAGER_THRESHOLD=128000 # This setting may give 1-2% of performance increase over the # default value of 262000 for large problems and high number of cores
cp HPL_serial.dat HPL.dat
echo -n "This run was done on: " date
# Capture some meaningful data for future reference: echo -n "This run was done on: " >> $OUT date >> $OUT echo "HPL.dat: " >> $OUT cat HPL.dat >> $OUT echo "Binary name: " >> $OUT ls -l xhpl_intel64 >> $OUT echo "This script: " >> $OUT cat runme_intel64 >> $OUT echo "Environment variables: " >> $OUT env >> $OUT echo "Actual run: " >> $OUT
# Environment variables can also be set on the Intel MPI command # line using the -genv option:
# mpiexec -np 4 ./xhpl_intel64 | tee -a xhpl_intel64_outputs.txt
# In case of multiple nodes involved, please set the number of MPI processes # per node (ppn=1,2 typically) through the -perhost option (because the # default is all cores):
Any thoughts? I've regularly run MPI programs on the cluster with openmpi-1.4.2 but things are usually a little different. I tend not to use my head node as one of the compute nodes, this Intel library seems to like to use it as rank 0 :( That's a minor thing though, I suspect there is a way around it.
Any clues on the above and why I couldn't connect? Are there environment variables to indicate the ibx card to use?
Just a quick follow-up, I am able to run across most of my nodes at this point, except for those that have cards other than ib0 as their primary interface. If this is, in fact, a simple problem of mpiexec and mpd defaulting to want to run on ib0, is there an easy way to tell it on several nodes to use a different ib?