I am looking to run IntelMPI over a RDMA capable fabric. I was wondering what is the usual way that this setup runs (since I am still acquiring IntelMPI).
1. Does IntelMPI run with one process per CPU (or hyperthread) ? Or does it runs with one thread per CPU (or hyperthread) ?
2. How many the queue pairs does each machine consume ? For example, if I have M machines in the cluster and each machine has N cores, does each machine need
(a) M - 1 queue-pairs (one QP per machine to talk to M - 1 machines in the cluster) or
(b) N * (M - 1) queue pairs (one QP per local core to talk to M - 1 machines in the cluster or
(c) N * N * (M - 1) queue pairs (one QP per local core to talk to N * (M - 1) cores in the cluster).
3. By default, does IntelMPI use UD more or RC mode ?
Answers from our expert:
1. Every MPI rank is a process, which runs on a physical core or a logical core based on the machine configuration and environment variable settings (for e.g. I_MPI_PIN_PROCESSOR_LIST). By setting I_MPI_DEBUG=5, and running any MPI application of interest, it is possible to exactly see where the ranks are pinned or free to run, from the extra debug output that gets printed on the screen.
2. The number of queue pairs depend on the number of ranks. Assuming a fully subscribed scenario, we can start speaking in terms of number of cores as well. The number of queue pairs depend on the type of connection – Reliable connection (RC) or User Datagram (UD). For RC, no. of QPs = N*N*M and for UD it would be N * M.
Queue pairs is a low-level concept and users of Intel MPI Library don’t need to configure them at the level of Intel MPI Library.
3. The selection logic for UD or RC depends on the no. of ranks, no. of nodes and the fabric provider being used. Generally, for small scale IMPI selects RC and for larger scale runs it selects UD.