Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Hybrid MPI/OpenMP process pinning

Joel1
Beginner
1,461 Views
I have a large SMP system on which I am trying to run a hybrid MPI/OpenMP code, and am looking for some info on doing correct process placement for my system when using Intel MPI. Using I_MPI_PIN_DOMAIN=socket and KMP_AFFINITY=compact gives expected results, with each rank (and all of its threads) running on a single socket. But this setup always includes the first socket in the system, which does not work since this is a multi user system and many people run on the system at the same time.
The logical step appears to use cpu masks with I_MPI_PIN_DOMAIN. I would have expected I_MPI_PIN_DOMAIN=[3F000,FC0000,3F000000,FC0000000] to create domains on cores 12-17, 18-23, 24-29, and 30-35. But one of these domains always ends up being a catch-all for the cores that were not specified, which leads to a rank being pinned again on the first socket.
I have tried other methods such as numactl, but Intel MPI does not appear to respect these tools for help with placement.
As an example some debugging output withI_MPI_PIN_DOMAIN=[3F000,FC0000,3F000000,FC0000000] is seen below. The Intel MPI is version 4.0 update 2.
[0] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[0] Rank Pid Node name Pin cpu
[0] 0 8231 host.domain {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239}
[0] 1 8229 host.domain {24,25,26,27,28,29}
[0] 2 8230 host.domain {30,31,32,33,34,35}
[0] 3 8232 host.domain {36,37,38,39,40,41}
[0] MPI startup(): I_MPI_ADJUST_BCAST=3
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:tcp
[0] MPI startup(): I_MPI_SHM_BUFFER_SIZE=131072
[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=192.168.1.1
Are there any suggestions on making process placement work for hybrid MPI/OpenMP where the the first socket is never used?
Thanks...
Joel
0 Kudos
1 Solution
Dmitry_K_Intel2
Employee
1,461 Views
Hi Joel,

The mask you used is correct, but there was an issue in the library which didn't allow to set correct pinning.
Could you please download version 4.0 Update 3 of the Intel MPI Library and give it a try?

Regards!
Dmitry

View solution in original post

0 Kudos
6 Replies
Dmitry_K_Intel2
Employee
1,462 Views
Hi Joel,

The mask you used is correct, but there was an issue in the library which didn't allow to set correct pinning.
Could you please download version 4.0 Update 3 of the Intel MPI Library and give it a try?

Regards!
Dmitry
0 Kudos
Joel1
Beginner
1,461 Views
Dmitry,
That fixes it. Thanks.
Joel
0 Kudos
Joel1
Beginner
1,461 Views
It appears that I am still running into a pinning problem when using masks, but the problem is different than before.
I have a set of simulations that I am trying to execute, all using the same pinning variables:
I_MPI_PIN=on
I_MPI_PIN_MODE=mpd
I_MPI_PIN_DOMAIN=[3F(...)000,FC0(...)000,3F(...)000000,FC0(...)000000,etc]
where the masks are associated with cores 144-149, 150-155,(...), 228-233, 234-239. (i.e.16 ranks placed on sockets 24-39 of a 6-core-per-socket system).
The system on which these simulations are running is idle except for the simulations I am running. In 8 or 9 out of 10 runs IMPI does placement correctly. But in the other cases (again with identical I_MPI_PIN variable settings) it is oversubscribing a set of nodes. In these cases it is placing ranks 0-7 AND ranks 8-15 on sockets 24-31 and ignoring sockets 32-39.
I have not yet been able to make this a repeatable error, outside of running the software over and over until it occurs. My log file got wiped out so I do not have the output from I_MPI_DEBUG, and right now the system is correctly placing processes. If I get a case that fails I will attach the output.
Joel
0 Kudos
Joel1
Beginner
1,461 Views
We found I_MPI_DEBUG output from a case that failed in respecting the mask settings. In this instance ranks 0,3,6 and 9 were placed on socket 28; ranks 1, 4, 7, and 10 were placed on socket 29; and ranks 2, 5, 8, and 11 were placed on socket 30. This job was just run again (with identical settings as before) to see if this would be reproduceable but the system correctly placed the processes this time.
I_MPI_PIN=on
I_MPI_PIN_MODE=mpd
I_MPI_PIN_DOMAIN=[
3F000000000000000000000000000000000000000000
FC0000000000000000000000000000000000000000000
3F000000000000000000000000000000000000000000000
FC0000000000000000000000000000000000000000000000
3F000000000000000000000000000000000000000000000000
FC0000000000000000000000000000000000000000000000000
3F000000000000000000000000000000000000000000000000000
FC0000000000000000000000000000000000000000000000000000
3F000000000000000000000000000000000000000000000000000000
FC0000000000000000000000000000000000000000000000000000000
3F000000000000000000000000000000000000000000000000000000000
FC0000000000000000000000000000000000000000000000000000000000
]
...
[7] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[11] MPI startup(): shm and tcp data transfer modes
[8] MPI startup(): shm and tcp data transfer modes
[5] MPI startup(): shm and tcp data transfer modes
[4] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[6] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[10] MPI startup(): shm and tcp data transfer modes
[9] MPI startup(): shm and tcp data transfer modes
[0] Rank Pid Node name Pin cpu
[0] 0 19930 local.domain {168,169,170,171,172,173}
[0] 1 19919 local.domain {174,175,176,177,178,179}
[0] 2 19921 local.domain {180,181,182,183,184,185}
[0] 3 19920 local.domain {168,169,170,171,172,173}
[0] 4 19922 local.domain {174,175,176,177,178,179}
[0] 5 19923 local.domain {180,181,182,183,184,185}
[0] 6 19924 local.domain {168,169,170,171,172,173}
[0] 7 19925 local.domain {174,175,176,177,178,179}
[0] 8 19926 local.domain {180,181,182,183,184,185}
[0] 9 19927 local.domain {168,169,170,171,172,173}
[0] 10 19929 local.domain {174,175,176,177,178,179}
[0] 11 19928 local.domain {180,181,182,183,184,185}
[0] MPI startup(): I_MPI_ADJUST_BCAST=3
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:tcp
[0] MPI startup(): I_MPI_SHM_BUFFER_SIZE=131072
[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=192.168.1.1
~
I_MPI_PIN=onI_MPI_PIN_MODE=mpdI_MPI_PIN_DOMAIN=[3F000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000000000FC0000000000000000000000000000000000000000000000000000000000]
...[7] MPI startup(): shm and tcp data transfer modes[0] MPI startup(): shm and tcp data transfer modes[3] MPI startup(): shm and tcp data transfer modes[11] MPI startup(): shm and tcp data transfer modes[8] MPI startup(): shm and tcp data transfer modes[5] MPI startup(): shm and tcp data transfer modes[4] MPI startup(): shm and tcp data transfer modes[2] MPI startup(): shm and tcp data transfer modes[6] MPI startup(): shm and tcp data transfer modes[1] MPI startup(): shm and tcp data transfer modes[10] MPI startup(): shm and tcp data transfer modes[9] MPI startup(): shm and tcp data transfer modes[0] Rank Pid Node name Pin cpu[0] 0 19930 local.domain {168,169,170,171,172,173}[0] 1 19919 local.domain {174,175,176,177,178,179}[0] 2 19921 local.domain {180,181,182,183,184,185}[0] 3 19920 local.domain {168,169,170,171,172,173}[0] 4 19922 local.domain {174,175,176,177,178,179}[0] 5 19923 local.domain {180,181,182,183,184,185}[0] 6 19924 local.domain {168,169,170,171,172,173}[0] 7 19925 local.domain {174,175,176,177,178,179}[0] 8 19926 local.domain {180,181,182,183,184,185}[0] 9 19927 local.domain {168,169,170,171,172,173}[0] 10 19929 local.domain {174,175,176,177,178,179}[0] 11 19928 local.domain {180,181,182,183,184,185}[0] MPI startup(): I_MPI_ADJUST_BCAST=3[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS=shm:tcp[0] MPI startup(): I_MPI_SHM_BUFFER_SIZE=131072[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=192.168.1.1~
0 Kudos
Dmitry_K_Intel2
Employee
1,461 Views
Hi Joel,

Right now mask for pinnig has 64 bits only. So, might be this is the reason of unstable behavior.
Please try to avoid using of I_MPI_PIN_MODE - it's useless in your case.

Could you please provide output of cpuinfo utility (from Intel MPI) and command line used to run the application. (Also env variables which may affect execution)

Regards!
Dmitry
0 Kudos
Dmitry_S_Intel
Moderator
1,461 Views

Hi,

Please try Intel(R) MPI Library 4.1.0.030.

You can find it at https://registrationcenter.intel.com/RegCenter/Download.aspx?productid=1626

--

Dmitry

0 Kudos
Reply