Dual-rail MPI binding

Filippo_Spiga · ‎12-30-2013

Dear expert,

I seek confirmation that I am doing stuff properly. Here my situation. The new cluster in my institution has two Mellanox Connect-IB cards on each node. Each node is a dual socket six-core Ivy Bridge. The node architecture is such that each socket is connected with a straight PCIe lane to each IB card. What I want to do is basically assign a subset of the MPI processes (e.g. the first 6) to first IB card and the other MPI processes to the second IB card. No rail sharing, for both small and large messages a MPI should use one single (assigned) IB card.

Here what I did...

export I_MPI_FABRICS=shm:ofa

export I_MPI_OFA_NUM_ADAPTERS=2

export I_MPI_OFA_ADAPTER_NAME=mlx5_0,mlx5_1

export I_MPI_OFA_RAIL_SCHEDULER=PROCESS_BIND

export I_MPI_PIN_DOMAIN=core

export I_MPI_PIN_ORDER=scatter

export I_MPI_DEBUG=6

mpirun -genvall -print-rank-map -np 24 -ppn 12 ./run_dual_bind <exe>

The "run_dual_bind" script contains...

#!/bin/bash

lrank=$(($PMI_RANK % 12))

case ${lrank} in

0|1|2|3|4|5)

export CUDA_VISIBLE_DEVICES=0

export I_MPI_OFA_NUM_ADAPTERS=1

export I_MPI_OFA_ADAPTER_NAME=mlx5_0

"$@"

;;

6|7|8|9|10|11)

export I_MPI_OFA_NUM_ADAPTERS=1

export I_MPI_OFA_ADAPTER_NAME=mlx5_1

"$@"

;;

esac

In theory it should work. I can verify the MPI bindind looking at the mpirun output but I have no idea if the interconnect I want to use is really used.

Am I doing stuff properly? Is this the exact way to realize fine-grain rail binding?

Many thanks in advance. I also take the opportunity to wish everybody a Happy New Year!

Filippo