Scatter the processes across sockets

ZQyouOSC · ‎06-26-2024

Hello,

I am researching how to pin processes across sockets using the I_MPI_PIN_DOMAIN and I_MPI_PIN_ORDER environment variables. I have tried many combinations, but none of them have worked. I used Intel MPI 2021.12.1 and ran TACC amask on two systems: one equipped with two Xeon Platinum 8470 CPUs and the other equipped with two Xeon CPU Max 9470 CPUs.

With the default settings, I ran the following command:

mpiexec -n 4 amask_mpi

and I got the process placement in bunch order as expected:

     Each row of matrix is a mask for a Hardware Thread (hwt).                                                                                                                            
     CORE ID  = matrix digit + column group # in |...|                                                                                                                                       
     A set mask bit (proc-id) = core id + add 104 to each additional row.                                                                                                                    
                                                                                                                                                                                          
rank |    0    |   10    |   20    |   30    |   40    |   50    |   60    |   70    |   80    |   90    |   100   |                                                        
0000 0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2-----------                                                                         
0001 --2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---------                                                                   
0002 -1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3----------                                                                   
0003 ---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5--------

The first two processes are bound to the first and second NUMA nodes in the first socket, and the remaining two processes are bound to the first and second NUMA nodes in the second socket.

I can also achieve the process placement in compact order with the following variables:

I_MPI_PIN_DOMAIN = core                                                                                                                            
I_MPI_PIN_ORDER = compact

and I got the processes placed in the first NUMA node:

     Each row of matrix is a mask for a Hardware Thread (hwt).                                                                                                                                
     CORE ID  = matrix digit + column group # in |...|                                                                                                                                        
     A set mask bit (proc-id) = core id + add 104 to each additional row.                                                                                                                     
                                                                                                                                                                                             
rank |    0    |   10    |   20    |   30    |   40    |   50    |   60    |   70    |   80    |   90    |   100   |                                                                          
0000 0-------------------------------------------------------------------------------------------------------                                                                                 
0001 --------8-----------------------------------------------------------------------------------------------                                                                                 
0002 ----------------6---------------------------------------------------------------------------------------                                                                                 
0003 ------------------------4-------------------------------------------------------------------------------

I have been experimenting with different values of I_MPI_PIN_DOMAIN and I_MPI_PIN_ORDER to achieve process placement in a scatter order across sockets. For example, I want the first and third processes to be bound to the first socket, and the second and fourth processes to be bound to the second socket. However, I have not had any success in finding a working combination. Could you please provide any suggestions? Thank you.