Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2218 Discussions

IMPI and DAPL fabrics on Infiniband cluster

Marcos_V_1
New Contributor I
1,643 Views

Hello, I have been trying to submit a job in our cluster for a intel17 compiled and impi enabled code. I keep getting trouble at startup when running through PBS.

This is the submission script:

#!/bin/bash
#PBS -N propane_XO2_ramp_dx_p3125cm(IMPI)
#PBS -W umask=0022
#PBS -e /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.err
#PBS -o /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.log
#PBS -l nodes=16:ppn=12
#PBS -l walltime=999:0:0
module purge
module load null modules torque-maui intel/17
export OMP_NUM_THREADS=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=OpenIB-cma
export I_MPI_FALLBACK_DEVICE=0
export I_MPI_DEBUG=100
cd /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind
echo
echo $PBS_O_HOME
echo `date`
echo "Input file: propane_XO2_ramp_dx_p3125cm.fds"
echo " Directory: `pwd`"
echo "      Host: `hostname`"
/opt/intel17/compilers_and_libraries/linux/mpi/bin64/mpiexec   -np 184 /home4/mnv/FIREMODELS_ISSUES/fds/Build/impi_intel_linux_64/fds_impi_intel_linux_64 propane_XO2_ramp_dx_p3125cm.fds

As you can see I'm invoking DAPL and OpenIB-cma as dapl provider. This is what I see on my login node /etc/dat.conf

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-ib2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib2 0" ""
ofa-v2-ib3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib3 0" ""
ofa-v2-bond u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "bond0 0" ""

Now logging in to the actual compute nodes I don't see an /etc/dat.conf on these. I don't know if this is normal or there is an issue there.

Anyways, when I submit the job I get the following attached stdout file, where it seems some of the nodes fail to load OpenIB-cma (with no fallback fabrics).

To be sure, some nodes on the cluster use Qlogic infiniband cards and others use Mellanox.

At this point I've tried several combinations, either specifying or not ib fabrics, without success. I'd really appreciate if you help me troubleshooting this.

Thank you,

Marcos

 

 

0 Kudos
3 Replies
Marcos_V_1
New Contributor I
1,643 Views

An extra note:

You can see in the attached file that MPI processes that fail to load OpenIB-cma, are not tied to nodes that use a particular qib0:0 or mlx4_0:0 numa map. See for example process [57] or [108].

Thank you,

Marcos

0 Kudos
John_H_19
Beginner
1,643 Views

Hi Marcos. I have run Fire and Smoke simulations quite a few times. Most recently on an Omnipath fabric, but that is another story.  I would suggest getting whoever runs your cluster to set a node property in PBS such that you can choose all Mellanox or all Qlogic cards.  Also can you run with either leaving the I_MPI_FABRICS not set, or using ofa ?

 

 

 

 

 

 

 

 

 

0 Kudos
Marcos_V_1
New Contributor I
1,643 Views

Hi John, thank you for your reply! Yes, we do have dedicated queues for Qlogic (24 nodes I think) and Mellanox (12 nodes I think). We have been trying for some time to be able to run large jobs that span more than one dedicated queue, and have been somewhat successful with openmpi (there have been other issues like a constant memory leak we can't track down to our source code).

I have noted that intel mpi (when it runs) does run quite faster than the openmpi available, hence trying to span impi jobs across both sets of nodes. 

I did try running the job using ofa instead of dapl, and also dapl selecting ofa-v2-ib0 in the above configuration list. The problem has always been that the calculation randomly times out at different communication steps. Also, I run the case using tcp and although extremely slow It has run overnight without interruption.

Best Regards,

Marcos

0 Kudos
Reply