LINPACK with multiple MPI ranks does not behave as expected

grantcurell · ‎03-27-2024

Bottom line question up front:

When you run something like against two nodes:

mpirun -perhost 2 -np 8 -genv NUMA_PER_MPI=1 ./runme_intel64_prv

what exactly does -perhost, -np, and NUMA_PER_MPI do? I'm a Dell guy and neither myself or any of my colleagues have been able to figure out how this works.

Equipment

For testing, I am using two identical servers with the attached server specs. TLDR: 56 cores, two sockets per server, 512GB RAM per, no hyperthreading, four NUMA domains per socket with eight per host.

What We Expect vs What Happens

We launch with this SLURM file:

#!/bin/bash
#SBATCH --job-name=linpack
#SBATCH --output=linpack_%j.out
#SBATCH --partition=8480    # Specify your partition
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=8
#SBATCH --time=0:30:00                # Time limit in the format hours:minutes:seconds

# Load the required modules
module load intel/oneAPI/2023.0.0
module load compiler-rt/2023.0.0 mkl/2023.0.0 mpi/2021.8.0

# Navigate to the directory containing your HPL files
cd /home/grant/mp_linpack

# Run the HPL benchmark
mpirun -perhost 2 -np 8 -genv NUMA_PER_MPI=1 ./runme_intel64_prv | tee -a $OUT

For all testing we used the following settings for HPL.dat:

N        :  117120        0        0
NB       :     384
PMAP     : Row-major process mapping
P        :       2
Q        :       4
PFACT    :   Right     Left     Left
NBMIN    :       4
NDIV     :       2
RFACT    :   Right
BCAST    :  1ringM
DEPTH    :       1
SWAP     : Mix (threshold = 64)
L1       : transposed form
U        : transposed form
EQUIL    : yes
ALIGN    :   16 double precision words

Firstly, we noticed that SLURM's --ntasks or ntasks-per-node are completely irrelevant. Some internal logic makes sure mpirun runs once and only once on a single node. You can change those settings as much as you want and the results are identical.

With that said, to the heart of the problem. In the above you can see I have set perhost to 2 so we expect runme_intel64_prv to run twice on each host.

First contradiction: This is not what happens. runme_intel64_prv will run four times. In fact, in our testing, no matter what you do it will ALWAYS run four times per host. You can set it to whatever you want and it will always do this. See the file process_list_with_per_host_set_to_2.txt and process_list_with_per_host_set_to_4.txt. As you can see, four instances of runme_intel64_prv runs regardless.

What This Means to Us: Based on this, we assumed that the number of instances of runme_intel64_prv is tied to the fact that we set -np to 8 and -perhost is wholesale ignored. The binary simply assigns four instances of runme_intel64_prv to both hosts regardless. If that is the case, then we expect to be able to set -perhost to whatever we want and the performance should be identical.

Second Contradiction: This is not what happens. Changing -perhost has a huge impact on performance even though it has no impact whatsoever on either the number of threads running or the number of instances of runme_intel64_prv. See files performance_with_per_host_set_to_2 and performance_with_per_host_set_to_4. Bottom line: the performance of -perhost set to 2 is ~38% faster than when it is set to 4. That setting very clearly matters even though it has no effect on the process/thread count.

UPDATE:

After staring at this for sometime, at a minimum, there is a bug present in the executable. The documentation says this about the -perhost setting:

When running under a job scheduler, this environment variable is ignored by default. To control process placement with I_MPI_PERHOST, disable the I_MPI_JOB_RESPECT_PROCESS_PLACEMENT variable.

Based on what we're seeing above, this isn't accurate. The variable is clearly used in some way as it has a huge impact on performance despite the fact that we are running with SLURM.

Contradictions within Documentation

I don't want to make this too long, but I worked with some Intel colleagues and there was a lot that took us to this point and enumerating it all is too long so here are some highlights.

Using the Intel-recommended values causes the system to crash. Ex: the recommend number of MPI processes is 16 for two nodes with 8 per node and both P and Q set to 4. If you do this the system crashes with some variant of the message `HPL[ 15, z1-34] Failed memory mapping : NodeMask =`. If you go against the recommendations and set P to 2 and Q to 4, it works but the performance is bad. In the same neighborhood as you see in performance_with_per_host_set_to_4.txt
The documentation says to set NUMA_PER_MPI based on the number of NUMA subdomains. If you do this the performance tanks. Ostensibly, you would set it to four as that is the number of NUMA subdomains, but doing this gets you performance in the same neighborhood as performance_with_per_host_set_to_4.txt

It's not NUMA Alignment (as far as we can tell)

While I haven't sat down and wrote code to exhastively check this, I went and pulled `numastat -p <pid>`for several threads. The sampling I took was correctly aligned to a single NUMA domain regardless of whether perhost was 2 or 4.

Conclusion

There is something going on in the `xhpl_intel64_dynamic` binary with respect to -perhost that does not appear to affect NUMA-alignment, the number of threads spawned, or anything else in the process architecture but has a significant impact on performance.

We're trying to figure out what that is.

grantcurell · ‎04-03-2024

Update

I simplified things by removing SLURM entirely from the equation. Abnormal behavior persists though it is significantly different. Setting `-ppn` to 2 still does not lead to only two processes running. I am also using runme_intel64_dynamic.sh. At this point, what I really want to know is what should -ppn (perhost) and -np be set to (or MPI_PROC_NUM / MPI_PER_NODE) for a given config? It doesn't follow any of the HPL standards since it is doing custom threading under the hood and the docs do not say.

Worth noting since there was a comment on this - runme_intel64_dynamic.sh does nothing special. It sets MPI_PROC_NUM, MPI_PER_NODE, and NUMA_PER_MPI and IF (and only if) you are doing GPU stuff it sets a few extra environment variables. Since I'm not doing GPU stuff, it literally just takes those three environment variables and passes them directly into mpirun unchanged with `mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT`. NUMA_PER_MPI is passed via the environment.

What I have above is 100% functionally identical. See below:

#!/bin/bash
#===============================================================================
# Copyright 2001-2023 Intel Corporation.
#
# This software and the related documents are Intel copyrighted  materials,  and
# your use of  them is  governed by the  express license  under which  they were
# provided to you (License).  Unless the License provides otherwise, you may not
# use, modify, copy, publish, distribute,  disclose or transmit this software or
# the related documents without Intel's prior written permission.
#
# This software and the related documents  are provided as  is,  with no express
# or implied  warranties,  other  than those  that are  expressly stated  in the
# License.
#===============================================================================

# Set total number of MPI processes for the HPL (should be equal to PxQ).
export MPI_PROC_NUM=4

# Set the MPI per node for each node.
# MPI_PER_NODE should be equal to 1 or number of sockets on the system.
# It will be same as -perhost or -ppn paramaters in mpirun/mpiexec.
export MPI_PER_NODE=2

# Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI)
# should be equal to number of NUMA nodes on the system.
export NUMA_PER_MPI=4

#====================================================================
# Following option is for Intel(R) Optimized HPL-AI Benchmark
#====================================================================

# Comment in to enable Intel(R) Optimized HPL-AI Benchmark
# export USE_HPL_AI=1

#====================================================================
# Following option is for Intel(R) Optimized HPL-AI Benchmark for GPU
#====================================================================

# By default, Intel(R) Optimized HPL-AI Benchmark for GPU will use
# Bfloat16 matrix. If you prefer less iterations, you could choose
# float based matrix. But it will reduce maximum problem size. 
# export USE_BF16MAT=0

#====================================================================
# Following options are for Intel(R) Distribution for LINPACK
# Benchmark for GPU and Intel(R) Optimized HPL-AI Benchmark for GPU
#====================================================================

# Comment in to enable GPUs
# export USE_HPL_GPU=1

# Select backend driver for GPU (OpenCL ... 0, Level Zero ... 1)
# export HPL_DRIVER=0

# Number of stacks on each GPU
# export HPL_NUMSTACK=2

# Total number of GPUs on each node
# export HPL_NUMDEV=2

#====================================================================

export OUT=xhpl_intel64_dynamic_outputs.txt

if [ -z ${USE_HPL_AI} ]; then
if [ -z ${USE_HPL_GPU} ]; then
export HPL_EXE=xhpl_intel64_dynamic
else
export HPL_EXE=xhpl_intel64_dynamic_gpu
fi
else
if [ -z ${USE_HPL_GPU} ]; then
export HPL_EXE=xhpl-ai_intel64_dynamic
else
export HPL_EXE=xhpl-ai_intel64_dynamic_gpu
fi
fi

echo -n "This run was done on: "
date

# Capture some meaningful data for future reference:
echo -n "This run was done on: " >> $OUT
date >> $OUT
echo "HPL.dat: " >> $OUT
cat HPL.dat >> $OUT
echo "Binary name: " >> $OUT
ls -l ${HPL_EXE} >> $OUT
echo "This script: " >> $OUT
cat runme_intel64_dynamic >> $OUT
echo "Environment variables: " >> $OUT
env >> $OUT
echo "Actual run: " >> $OUT

# Environment variables can also be also be set on the Intel(R) MPI Library command
# line using the -genv option (to appear before the -np 1):

#export OMP_NUM_THREADS=10
export HPL_NUMTHREADS=56
mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT

echo -n "Done: " >> $OUT
date >> $OUT

echo -n "Done: "
date

grantcurell · ‎04-05-2024

I'm working on a full write up that I will post here, but I did figure out what was causing this and honestly, it's a bit bananas.

After reverse engineering everything, I realized that mpirun will automatically detect SLURM's environment variables and SILENTLY override everything you do on the Intel command line. ppn should control the number of threads allocated to each machine and it will do that, but not if SLURM's environment variables are present. However, even more nutty is that the code for runme_intel64_prv is set up to print the rank and node of all the MPI processes based on your inputs to runme_intel64_dynamic. So you look at the output and think, "Yes, that's what I want" but on the backend it absolutely doesn't do that, which, if you're new to this like I am, was fabulously confusing and took me I don't know how many hours to figure out.

However, since SLURM was present in the background environment for me, even when I wasn't directly running via a SLURM job, Intel's framework would silently override everything I did. Once I figured this out -ppn does control the number of runme_intel64_prv processes running on each node which in turn spawns xhpl_intel64_dynamic.

If you haven't been exposed to LINPACK before this is all going to be extremely confusing because some of the variables in runme_intel64_dynamic are consumed by mpirun, some by hydra, some by runme_intel64_prv, and still others by xhpl_intel64_dynamic. However the documentation is written in such a way that unless you have reverse engineered the code base you would never know to look at the docs for hydra for values like ppn because those values are arguments passed in runme_intel64_dynamic which then go to mpirun, which then go to hydra.

Lastly, newbies also have to figure out where to place the host file. I was told to use runme_intel64_dynamic, but what isn't mentioned in any documentation anywhere is that you have to add the hosts file to mpirun inside runme_intel64_dynamic which then passes it to mpirun, which then passes it to hydra. The only way you would know to do this is reverse engineering the code base.

Strongly recommend that if the code is going to actively override what the user is explicitly telling it to do this be verbose output and not a silent operation which can only be discovered by reverse engineering the code base. I would also more gently recommend the documentation have a quick start or something for beginners to help them get them on their feet.