I have been testing code using Intel MPI (version 4.1.3 build 20140226) and the Intel compiler (version 15.0.1 build 20141023) with 1024 or more total processes. When we attempt to run on 1024 or more processes we receive the following error:
MPI startup(): ofa fabric is not available and fallback fabric is not enabled
Anything less than 1024 processes does not produce this error, and I also do not receive this error with 1024 processes using OpenMPI and GCC.
I am using the High Performance Conjugate Gradient benchmark as my test code, although we have received the same errors with other test codes.
Could you please provide more details about your MPI runs (IMPI environment variables, command line options, OS/OFED versions, processor type, InfiniBand adapter name, number of involved hosts and so on)?
Are you able to run with newer Intel MPI Library (5.x)?
Absolutely, thank you for the response.
I ran the tests with the following IMPI variables:
It was submitted through SLURM scheduling with the following batch script:
srun ../../xhpcg > /dev/null
OS: Red Hat release 6.6 (Santiago), OFED: OFED-18.104.22.168
All tests were run on 64 total nodes, with two Intel E5-2650v2 CPUs (16 total cores) per node, linked with QLogic Corp. IBA7322 Infiniband HCA (rev 02) cards connected to a QLogic 12800-180 switch.
We rely on another company to handle our licenses and updates with Intel-MPI, although I believe that we will be upgrading to Intel MPI Library v5.x soon.
Thanks for the clarification.
As far as I see you use Intel True Scale (aka QLogic) IBAs, 'shm:ofa' may work nonoptimal on such IBAs.
You can use 'tmi/shm:tmi' fabric which is designed for such cases.
Thank you so much for your help, this solved the issue we were having, as well as another issue that we were having!
I'm just curious, do you have any idea why this problem only seemed to surface after going over 1023 processes?