Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Jack_S_2
Beginner
86 Views

Problem with Intel MPI on >1023 processes

I have been testing code using Intel MPI (version 4.1.3  build 20140226) and the Intel compiler (version 15.0.1 build 20141023) with 1024 or more total processes. When we attempt to run on 1024 or more processes we receive the following error: 

MPI startup(): ofa fabric is not available and fallback fabric is not enabled 

Anything less than 1024 processes does not produce this error, and I also do not receive this error with 1024 processes using OpenMPI and GCC.

I am using the High Performance Conjugate Gradient benchmark as my test code, although we have received the same errors with other test codes. 

0 Kudos
4 Replies
Artem_R_Intel1
Employee
86 Views

Hi Jack,

Could you please provide more details about your MPI runs (IMPI environment variables, command line options, OS/OFED versions, processor type, InfiniBand adapter name, number of involved hosts and so on)?
Are you able to run with newer Intel MPI Library (5.x)?
 

Jack_S_2
Beginner
86 Views

Artem,

Absolutely, thank you for the response.

I ran the tests with the following IMPI variables:

I_MPI_FABRICS=shm:ofa 

It was submitted through SLURM scheduling with the following batch script:

#!/bin/bash
#SBATCH --job-name=HPCGeval
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --time=02:00:00
#SBATCH --account=support
#SBATCH --nodes=64
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --constraint=hpcf2013

export KMP_AFFINITY=compact
export OMP_NUM_THREADS=1
srun  ../../xhpcg > /dev/null

OS: Red Hat release 6.6 (Santiago), OFED: OFED-1.5.4.1

All tests were run on 64 total nodes, with two Intel E5-2650v2 CPUs (16 total cores) per node, linked with QLogic Corp. IBA7322 Infiniband HCA (rev 02) cards connected to a QLogic 12800-180 switch.

We rely on another company to handle our licenses and updates with Intel-MPI, although I believe that we will be upgrading to Intel MPI Library v5.x soon.

Best,

Jack

 

Artem_R_Intel1
Employee
86 Views

Hi Jack,

Thanks for the clarification.
As far as I see you use Intel True Scale (aka QLogic) IBAs, 'shm:ofa' may work nonoptimal on such IBAs.
You can use 'tmi/shm:tmi' fabric which is designed for such cases.

Usage:
export I_MPI_FABRICS=shm:tmi
export TMI_CONFIG=<path_to_impi>/intel64/etc/tmi.conf
 

Jack_S_2
Beginner
86 Views

Artem,

Thank you so much for your help, this solved the issue we were having, as well as another issue that we were having!

I'm just curious, do you have any idea why this problem only seemed to surface after going over 1023 processes?

Best,

Jack

Reply