MPI code returning different results when running with 6 ranks

DaniRegener · ‎02-27-2025

Good morning,

I am a newbie in HPC and I lost the control of my configuration of intelMPI in my local machine.

Info on machine

I have a processor st (result of lscpu)

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-12700H
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 1
Stepping: 3
CPU(s) scaling MHz: 10%
CPU max MHz: 4700,0000
CPU min MHz: 400,0000
BogoMIPS: 5376,00
Flags:...
Virtualization features: 
Virtualization: VT-x
Caches (sum of all): 
L1d: 544 KiB (14 instances)
L1i: 704 KiB (14 instances)
L2: 11,5 MiB (8 instances)
L3: 24 MiB (1 instance)
NUMA: 
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerabilities: ...

With the cores and threads organized as (result of lscpu --extended)

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 4600,0000 400,0000 487,5840
1 0 0 0 0:0:0:0 yes 4600,0000 400,0000 400,0000
2 0 0 1 4:4:1:0 yes 4600,0000 400,0000 500,0000
3 0 0 1 4:4:1:0 yes 4600,0000 400,0000 400,0000
4 0 0 2 8:8:2:0 yes 4700,0000 400,0000 500,0000
5 0 0 2 8:8:2:0 yes 4700,0000 400,0000 400,0000
6 0 0 3 12:12:3:0 yes 4700,0000 400,0000 495,7830
7 0 0 3 12:12:3:0 yes 4700,0000 400,0000 400,0000
8 0 0 4 16:16:4:0 yes 4600,0000 400,0000 500,0000
9 0 0 4 16:16:4:0 yes 4600,0000 400,0000 400,0000
10 0 0 5 20:20:5:0 yes 4600,0000 400,0000 480,9330
11 0 0 5 20:20:5:0 yes 4600,0000 400,0000 400,0000
12 0 0 6 24:24:6:0 yes 3500,0000 400,0000 400,5130
13 0 0 7 25:25:6:0 yes 3500,0000 400,0000 400,0020
14 0 0 8 26:26:6:0 yes 3500,0000 400,0000 400,0190
15 0 0 9 27:27:6:0 yes 3500,0000 400,0000 400,0000
16 0 0 10 28:28:7:0 yes 3500,0000 400,0000 400,0000
17 0 0 11 29:29:7:0 yes 3500,0000 400,0000 400,0000
18 0 0 12 30:30:7:0 yes 3500,0000 400,0000 400,0000
19 0 0 13 31:31:7:0 yes 3500,0000 400,0000 400,0340

What I wanted to do?

I wanted to bind the MPI processes to the threads 0,2,4,6,8,10, for using the P-Cores without hyper-threading, which I understand is the maximum performance for HPC. I run an MPI-Fortran code which is out of suspect since before trying to set this binding returned always the expected result.

What I did?

So I asked AI how to do that. Excluding a lot of different commands that did not work, it asked me to set

export I_MPI_PIN=1 
export I_MPI_PIN_PROCESSOR_LIST=0,2,4,6,8,10

Which did not distribute the processes as desired (checked running code and watching htop). I also set up


export I_MPI_PIN_DOMAIN=core

which did not work. I tried to install hwloc via sudo apt install, to set up it in the configuration of intelMPI as

I_MPI_HYDRA_TOPOLIB=hwloc

Which returned some warning about that the CLI interface could be unstable, and installed it. Did not work, so I restored the env variable and uninstalled it.

Info on the environment now

Environment variables are unset (result of env | grep I_MPI), except

I_MPI_ROOT=/[myhome]/intel/oneapi/mpi/2021.1

which I understand that comes from the sourcing of the environment variables in my .bashrc. The IntelMPI version is

Intel(R) MPI Library for Linux* OS, Version 2021.11 Build 20231005 (id: 74c4a23)
Copyright 2003-2023, Intel Corporation.

I am using cmake, so FC env variable is set to mpif90, and mpif90 is wrapped around

mpif90 for the Intel(R) MPI Library 2021.11 for Linux*
Copyright Intel Corporation.
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.5.0-6ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-xyHAOX/gcc-9-9.5.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
gcc version 9.5.0 (Ubuntu 9.5.0-6ubuntu2)

Info on the runs

A test code s.t.

program hello_world_mpi
use mpi

integer process_Rank, size_Of_Cluster, ierror

call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)

print *, 'Hello World from process: ', process_Rank, 'of ',
& size_Of_Cluster
call MPI_FINALIZE(ierror)
end

Does not produce unexpected results. Besides, if running the main code (which uses only PETSc library) using

mpirun -np 6 ./mainCode

Produces clearly wrong results (is an iterative code, and the tracking of the results of each iteration results dramatically wrong, but no NaNs, Infs, or SEGFAULTS present, no new warnings in valgrind and no floating point exceptions). Running with 4 or 8 processes works well. The debugging info for this 6 processes run is

[0] MPI startup(): Intel(R) MPI Library, Version 2021.11 Build 20231005 (id: 74c4a23)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): libfabric provider: tcp
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "[home]/intel/oneapi/mpi/2021.11/opt/mpi/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Load tuning file: "[home]/intel/oneapi/mpi/2021.11/opt/mpi/etc//tuning_generic_shm-ofi.dat"
[0] MPI startup(): ===== Nic pinning on dani-katana =====
[0] MPI startup(): Rank Pin nic
[0] MPI startup(): 0 enp4s0
[0] MPI startup(): 1 enp4s0
[0] MPI startup(): 2 enp4s0
[0] MPI startup(): 3 enp4s0
[0] MPI startup(): 4 enp4s0
[0] MPI startup(): 5 enp4s0
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 8003 [user] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
[0] MPI startup(): 1 8004 [user] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
[0] MPI startup(): 2 8005 [user] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
[0] MPI startup(): 3 8006 [user] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
[0] MPI startup(): 4 8007 [user] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
[0] MPI startup(): 5 8008 [user] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
[0] MPI startup(): I_MPI_ROOT=[home]/intel/oneapi/mpi/2021.11
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=-1
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default

but htop shows only activity in the 0,...,11 threads, so the ones associated with the P-Cores and their hyperthreading. My OS is (result of lsb_release -a)

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04 LTS
Release: 24.04
Codename: noble

And for this particular run is enforcing balanced power mode, but the same phenomena also happened with performance mode.

Conclusion

My apologizes for not being able to provide reproducible steps for the error, since I do not know how to produce it with a simpler code I would share. I tried to provide the meaningful information. Suggestions of fixings or debugging procedures would be highly appreciated.

Best,

TobiasK · ‎03-18-2025

@DaniRegener
please do not use hwloc for pinning. The upcoming 2021.15 release will include better pinning options for hybrid CPUs.
Please try with the latest version ,e.g. 2021.14.2 and check if the numerical problem still exists.

DaniRegener · ‎03-18-2025

@TobiasK

Thank you for your reply,

I reinstalled oneapi, now intelMPI is:

The problem persists. I experienced two new things that I did not see:

when using 6 threads for a longer simulation, it gets killed by signal 9 and
when using other number of threads (1,2,4,8,10) there appear small numerical errors that accumulate in time until they become visible and relevant (this I am not sure that is not related to the main program).

TobiasK · ‎03-18-2025

@DaniRegener do you have any hint that this is related to Intel MPI? you can enable reproducible results by setting
export I_MPI_CBWR=2

Best