Scatter the processes across sockets

ZQyouOSC · ‎06-26-2024

Hello,

I am researching how to pin processes across sockets using the I_MPI_PIN_DOMAIN and I_MPI_PIN_ORDER environment variables. I have tried many combinations, but none of them have worked. I used Intel MPI 2021.12.1 and ran TACC amask on two systems: one equipped with two Xeon Platinum 8470 CPUs and the other equipped with two Xeon CPU Max 9470 CPUs.

With the default settings, I ran the following command:

mpiexec -n 4 amask_mpi

and I got the process placement in bunch order as expected:

     Each row of matrix is a mask for a Hardware Thread (hwt).                                                                                                                            
     CORE ID  = matrix digit + column group # in |...|                                                                                                                                       
     A set mask bit (proc-id) = core id + add 104 to each additional row.                                                                                                                    
                                                                                                                                                                                          
rank |    0    |   10    |   20    |   30    |   40    |   50    |   60    |   70    |   80    |   90    |   100   |                                                        
0000 0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2-----------                                                                         
0001 --2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---8---2---6---0---4---------                                                                   
0002 -1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3----------                                                                   
0003 ---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5---9---3---7---1---5--------

The first two processes are bound to the first and second NUMA nodes in the first socket, and the remaining two processes are bound to the first and second NUMA nodes in the second socket.

I can also achieve the process placement in compact order with the following variables:

I_MPI_PIN_DOMAIN = core                                                                                                                            
I_MPI_PIN_ORDER = compact

and I got the processes placed in the first NUMA node:

     Each row of matrix is a mask for a Hardware Thread (hwt).                                                                                                                                
     CORE ID  = matrix digit + column group # in |...|                                                                                                                                        
     A set mask bit (proc-id) = core id + add 104 to each additional row.                                                                                                                     
                                                                                                                                                                                             
rank |    0    |   10    |   20    |   30    |   40    |   50    |   60    |   70    |   80    |   90    |   100   |                                                                          
0000 0-------------------------------------------------------------------------------------------------------                                                                                 
0001 --------8-----------------------------------------------------------------------------------------------                                                                                 
0002 ----------------6---------------------------------------------------------------------------------------                                                                                 
0003 ------------------------4-------------------------------------------------------------------------------

I have been experimenting with different values of I_MPI_PIN_DOMAIN and I_MPI_PIN_ORDER to achieve process placement in a scatter order across sockets. For example, I want the first and third processes to be bound to the first socket, and the second and fourth processes to be bound to the second socket. However, I have not had any success in finding a working combination. Could you please provide any suggestions? Thank you.

TobiasK · ‎07-01-2024

@ZQyouOSC
Can you please post the output after setting export I_MPI_DEBUG=10 ?

ZQyou · ‎07-03-2024

Sure. The following is the output with I_MPI_DEBUG=10, I_MPI_PIN_DOMAIN=core and I_MPI_PIN_ORDER=scatter:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (342 MB per rank) * (4 local ranks) = 1368 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1 
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx 
[0] MPI startup(): File "/apps/spack/0.21/cardinal/linux-rhel9-sapphirerapids/intel-oneapi-mpi/intel/2021.10.0/2021.10.0-a2ei2t4/mpi/2021.10.0/etc/tuning_spr_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/apps/spack/0.21/cardinal/linux-rhel9-sapphirerapids/intel-oneapi-mpi/intel/2021.10.0/2021.10.0-a2ei2t4/mpi/2021.10.0/etc/tuning_spr_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575) 
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151) 
[0] MPI startup(): Rank    Pid      Node name          Pin cpu 
[0] MPI startup(): 0       776163   c1002.ten.osc.edu  {0} 
[0] MPI startup(): 1       776164   c1002.ten.osc.edu  {8} 
[0] MPI startup(): 2       776165   c1002.ten.osc.edu  {16}
[0] MPI startup(): 3       776166   c1002.ten.osc.edu  {24}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/apps/spack/0.21/cardinal/linux-rhel9-sapphirerapids/intel-oneapi-mpi/intel/2021.10.0/2021.10.0-a2ei2t4/mpi/2021.10.0
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0
[0] MPI startup(): I_MPI_HYDRA_BRANCH_COUNT=-1
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_PIN_DOMAIN=core
[0] MPI startup(): I_MPI_PIN_ORDER=scatter
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10

     Each row of matrix is a mask for a Hardware Thread (hwt).
     CORE ID  = matrix digit + column group # in |...|
     A set mask bit (proc-id) = core id + add 104 to each additional row.

rank |    0    |   10    |   20    |   30    |   40    |   50    |   60    |   70    |   80    |   90    |   100   |
0000 0-------------------------------------------------------------------------------------------------------
0001 --------8-----------------------------------------------------------------------------------------------
0002 ----------------6---------------------------------------------------------------------------------------
0003 ------------------------4-------------------------------------------------------------------------------

TobiasK · ‎07-03-2024

It seems you are using Slurm. If you are using Slurm, please use srun und manage pinning through Slurm.

To use mpiexec/mpirun and ignore the Slurm settings please follow the settings here:

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-13/job-schedulers-support.html

ZQyou · ‎07-03-2024

Yes, I am using Slurm, but I am sure that I have used the IntelMPI Hydra process manager. When I use Slurm, I have set the following:

I_MPI_HYDRA_BOOTSTRAP=slurm
I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

I had these settings to use the Hydra process manager when I obtained the previous output I sent:

export -n I_MPI_HYDRA_BOOTSTRAP I_MPI_PMI_LIBRARY

I am confident because Slurm CPU binding control is not working.

I also tried using SSH as the Hydra bootstrap, but I got the same result.

$ export -n I_MPI_HYDRA_BOOTSTRAP I_MPI_PMI_LIBRARY 
$ export I_MPI_HYDRA_BOOTSTRAP=ssh
$ mpiexec -n 4 bin/amask_mpi

[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (342 MB per rank) * (4 local ranks) = 1368 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/apps/spack/0.21/cardinal/linux-rhel9-sapphirerapids/intel-oneapi-mpi/intel/2021.10.0/2021.10.0-a2ei2t4/mpi/2021.10.0/etc/tuning_spr_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/apps/spack/0.21/cardinal/linux-rhel9-sapphirerapids/intel-oneapi-mpi/intel/2021.10.0/2021.10.0-a2ei2t4/mpi/2021.10.0/etc/tuning_spr_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank    Pid      Node name          Pin cpu
[0] MPI startup(): 0       783094   c1002.ten.osc.edu  {0}
[0] MPI startup(): 1       783095   c1002.ten.osc.edu  {8}
[0] MPI startup(): 2       783096   c1002.ten.osc.edu  {16}
[0] MPI startup(): 3       783097   c1002.ten.osc.edu  {24}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/apps/spack/0.21/cardinal/linux-rhel9-sapphirerapids/intel-oneapi-mpi/intel/2021.10.0/2021.10.0-a2ei2t4/mpi/2021.10.0
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0
[0] MPI startup(): I_MPI_HYDRA_BRANCH_COUNT=-1
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_PIN_DOMAIN=core
[0] MPI startup(): I_MPI_PIN_ORDER=scatter
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10

     Each row of matrix is a mask for a Hardware Thread (hwt).
     CORE ID  = matrix digit + column group # in |...|
     A set mask bit (proc-id) = core id + add 104 to each additional row.

rank |    0    |   10    |   20    |   30    |   40    |   50    |   60    |   70    |   80    |   90    |   100   |
0000 0-------------------------------------------------------------------------------------------------------
0001 --------8-----------------------------------------------------------------------------------------------
0002 ----------------6---------------------------------------------------------------------------------------
0003 ------------------------4-------------------------------------------------------------------------------

TobiasK · ‎07-08-2024

@ZQyouOSC

can you please try the pinning simulator:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library-pinning-simulator.html

Also please set
I_MPI_PIN_RESPECT_HCA=0
I_MPI_PIN_RESPECT_CPUSET=0

Best

ZQyouOSC · ‎07-10-2024

Hello,

I have tried the simulator, and it suggests the following command line to achieve scatter pinning across sockets:

I_MPI_PIN_DOMAIN=core I_MPI_PIN_ORDER=scatter I_MPI_PIN_CELL=unit mpiexec -n 4

I used the same environment variables along with the other two you suggested previously. However, I still got the compact-order result as I reported.