How to interpret this KMP_AFFINITY info

Øyvind_Jensen · ‎03-10-2025

I am using MPI and OpenMP on a single node with 4 CPUs. Each CPU has 18 cores. I am trying to analyze the performance of my application by launching varying combinations of MPI processes and OpenMP threads. What confuses me, is that KMP_AFFINITY information indicates that each MPI-process gets pinned to exactly the same hardware threads.

I start my program like this:

export OMP_NUM_THREADS=8
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN_ORDER=compact
export KMP_AFFINITY=verbose
mpirun --ppn 2 --np 2 ./my_exe

stderr of MPI process 0:

OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-3,72-75
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 8 available OS procs
OMP: Info #158: KMP_AFFINITY: Uniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 4 cores/socket x 2 threads/core (4 total cores)
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 72 maps to socket 0 core 0 thread 1
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 core 1 thread 2
OMP: Info #172: KMP_AFFINITY: OS proc 73 maps to socket 0 core 1 thread 3
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 core 2 thread 4
OMP: Info #172: KMP_AFFINITY: OS proc 74 maps to socket 0 core 2 thread 5
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 core 3 thread 6
OMP: Info #172: KMP_AFFINITY: OS proc 75 maps to socket 0 core 3 thread 7

stderr of MPI process 1:

OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 4-7,76-79
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 8 available OS procs
OMP: Info #158: KMP_AFFINITY: Uniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 4 cores/socket x 2 threads/core (4 total cores)
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 core 0 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 76 maps to socket 0 core 0 thread 1
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 core 1 thread 2
OMP: Info #172: KMP_AFFINITY: OS proc 77 maps to socket 0 core 1 thread 3
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 core 2 thread 4
OMP: Info #172: KMP_AFFINITY: OS proc 78 maps to socket 0 core 2 thread 5
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 core 3 thread 6
OMP: Info #172: KMP_AFFINITY: OS proc 79 maps to socket 0 core 3 thread 7

As expected, each MPI process is using a unique set of OS procs. However, the mapping from OS proc to physical thread is conflicting. It indicates that they are competing for the same hardware resource. For example MPI process 0 maps OS proc 0 to (socket,core,thread) = (0,0,0) but MPI process 1 maps OS proc 4 to (0,0,0).

I have tried other combinations of MPI processes and OpenMP threads. Also I_MPI_PIN_DOMAIN=socket and I_MPI_PIN_ORDER=scatter gives the same conflicting mapping of OS proc to hardware resource.

Is there an error in how I start the program, or is it my interpretation of the KMP_AFFINITY information that is wrong?

TobiasK · ‎03-18-2025

@Øyvind_Jensen please provide the output of I_MPI_DEBUG=10. Are you using Slurm or PBS or any other job management system?

Øyvind_Jensen · ‎04-10-2025

@TobiasK, thanks for following up.

Here is the info on stdout with I_MPI_DEBUG=10

[0] MPI startup(): Intel(R) MPI Library, Version 2021.14  Build 20250213 (id: 0d7f579)
[0] MPI startup(): Copyright (C) 2003-2025 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (699 MB per rank) * (2 local ranks) = 1398 MB total
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.14/opt/mpi/etc/tuning_clx-ap_shm.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: async_progress coll split: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 30 (TAG_UB value: 1073741823)
[0] MPI startup(): source bits available: 0 (Maximal number of rank: 0)
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank    Pid      Node name                  Pin cpu
[0] MPI startup(): 0       2795806  d1-cmp-phy-lin5.ad.ife.no  {0,1,2,3,72,73,74,75}
[0] MPI startup(): 1       2795807  d1-cmp-phy-lin5.ad.ife.no  {4,5,6,7,76,77,78,79}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_PIN_DOMAIN=omp
[0] MPI startup(): I_MPI_PIN_ORDER=compact
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm
[0] MPI startup(): I_MPI_DEBUG=10

Øyvind_Jensen · ‎04-10-2025

I forgot to add that I launch with just a bash script, no job manager.

TobiasK · ‎04-14-2025

As you can see in the MPI output:
[0] MPI startup(): 0 2795806 d1-cmp-phy-lin5.ad.ife.no {0,1,2,3,72,73,74,75}
[0] MPI startup(): 1 2795807 d1-cmp-phy-lin5.ad.ife.no {4,5,6,7,76,77,78,79}

Which is also reflected by KMP_AFFINITY output:
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-3,72-75
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 4-7,76-79

However, the mapping to sockets, cores and threads are not correct, since the individual ranks only see the limited set of cpus. You may follow cpu utilization in top or other tools to confirm the pinning is working as expected.