Running Intel MPI with "I_MPI_FABRICS=ofa"

Nikhil_Mittal · ‎10-27-2011

Hello,

I started a thread on this topic on Oct 25th, 2011 but I am unable to see that thread now on this forum. I searched the forum but couldn't find it anywhere. I received two replies to my post in my email account as given below:

Reply 1:

Hi Nikhil,Well, this is a really specific case - it would be nice if you could explain why you cannot use shared memory. Might be we need to fix this issue instead of performance degradation with ofa.OFA fabric has its own settings and they were tuned

Reply 2:

Have you tried running the mpitune utility on your scenario?Although shared memory is the best choice for your setup, the tool may help you to identify MPI parameters that need to be modified as you are changing the usual assumptions regarding environment

My answers to these replies:

1. (reply 1): The two processes which are launched on the node (say control node) get dispatched to different nodes (say compute node) for actual execution (i.e. it's virtual execution on control node). There are two scenarios here:

(a) When both of the processes are dispatched to the same compute node, both "shm" or "ofa" work. But as I mentioned earlier, "ofa" runs quite slow.

(b) When these two processes are dispatched to two different compute nodes (one process to each compute node), then only "ofa" would work. And I get the same performance numbers as in (a) with "ofa" value.

2. (reply 2): I tried mpitune but it returns error as below:

27'Oct'11 18:37:33 WRN | Invalid default value ('/home/vertex/config.xml') of argument ('config-file').
27'Oct'11 18:37:33 CER | Invalid default value ('/home/vertex/options.xml') of argument ('options-file').
27'Oct'11 18:37:33 CER | A critical error has occurred!
Details:----------------------------------------------------------------------
Type : exceptions.Exception
Value : Invalid default value ('/home/vertex/options.xml') of argument ('options-file').

Why doe sit ask for default flile ?

Thanks,
Nikhil
+++++++++++++++++++++++++++++++++++++++++++++++++++++

Original thread:

I am using Intel MPI version 4.0.2.003 on CentOS 5.6 64 bit platform. I am using

IMB-MPI1 (pallas) benchmark on this platform. I have set the I_MPI_FABRICS=ofa (in other words I need to force the use of OFED for communication between MPI processes).

Problem description:

-----------------------

When I run as: "mpiexec -n 2 IMB-MPI1" it launches two processes on the node.

For some particular reason specific to my environment I can not use shared memory for I_MPI_FABRICS. Though the IMB-MPI1 benchmark suite runs fine, I am receiving almost 40% less performance numbers when I run the same suite using OpenMPI (without shared memory). Of course when I use I_MPI_FABRICS=shm when the processes are executing on the same node, I get very high performance numbers.

My question is: Is there a "loopback" mode in Intel MPI which I can try for the processes running on the same node ? Or is there any specific tuning parameter that I can use ?

Thanks,

Nikhil

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Dmitry_K_Intel2 · ‎10-28-2011

Hi Nikhil,

Well, from your reply on the first question I don't understand why you cannot use format like I_MPI_FABRICS=shm:ofa. shm fabric will be used if both processes are on the same node and ofa fabric will be used if processes run on different nodes.
Or might be you somehow can relocate an existing process to a remote node?
Might be your dispatcher can do something like:
if running_on_one_node:
mpirun -env I_MPI_FABRICS shm ...
else:
mpirun -env I_MPI_FABRICS ofa ...

From the log I don't understand why you get the error like:
"CER | Invalid default value ('/home/vertex/options.xml')"

File options.xml should be located in the //options.xml
Could you send the command line used to start mpitune? Might be there was an error in syntax.

Regards!
Dmitry

Nikhil_Mittal · ‎10-31-2011

Hi Dmitry,
I can use "I_MPI_FABRICS=shm:ofa" and it runs as expected i.e. uses shared memory for the processes running on same node and IB for processes between nodes. But again, the performance numbers are very low for the processes running across nodes using IB.

FYI: In my specific environment, no mpd daemons are required to be running on remote nodes; there is only one mpd daemon that is running on the node from where the MPI processes are launched.

I ran "mpitune" on the local node where it keeps on running for a long time. This time I didn't get the error (don't know what is different now than before).
Thanks,
Nikhil

Dmitry_K_Intel2 · ‎11-01-2011

Hi Nikhil,

To be honest I don't understand how your applications will work without mpd services. MPI applications require mpd daemons (or pmi_proxy in case of Hydra process manager) to get/send service information between mpi processes. Are you using your own process manager?

Talking about performance we need to know how you measure it. What banchmark do you use?
Could you please set I_MPI_DEBUG=5 and check that ofa fabric was selected in case of I_MPI_FABRICS=ofa for the run on 2 nodes? Intel MPI Library may use "fallback" mechanism and use tcp if ofa is not available. So, you'll get very bad performance.

Well, it seems to me that you need to share we me more information about your specific case - I'm afaid you are doing something wrong or might be you expect something that you shouldn't.

About mpitune:
there are 2 modes: cluster specific mode and application specific mode.
If you run cluster specific mode you need to use 'mpirun -tune' command line.
If you run application specific mode you need to use 'mpirun -tune generated_conf_file.cfg'
Pay attention that I'm talking about version 4.0.x of the Intel MPI library - syntax of mpitune for 3.2.x may be different.

Regards!
Dmitry