Running HPL with some threads per task

Rachko__Anton · ‎05-28-2019

I want to run HPL on 2P 36 cores node,

export OMP_PROC_BIND=TRUE

export OMP_PLACES=cores

export OMP_NUM_THREADS=2

mpirun -np 18 --map-by l3cache ./xhpl

but HPL runs with 18 threads instead 18 cores per each of two tasks and each task uses only one core:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 42052 user   20   0 3084748  69668  29116 R 100.0  0.0   0:42.47 xhpl
 42058 user   20   0 3084744  69716  29168 R 100.0  0.0   0:42.47 xhpl
 42062 user   20   0 3084748  73748  29100 R 100.0  0.0   0:42.47 xhpl
 42063 user   20   0 3086792  71724  29128 R 100.0  0.0   0:42.44 xhpl
 42067 user   20   0 3084748  73664  29024 R 100.0  0.0   0:42.46 xhpl
 42050 user   20   0   53.3g  22.9g  29148 R  99.7 12.2   0:42.46 xhpl
 42051 user   20   0 3086796  71744  29144 R  99.7  0.0   0:42.45 xhpl
 42053 user   20   0 3084748  69692  29140 R  99.7  0.0   0:42.46 xhpl
 42054 user   20   0 3084744  69720  29172 R  99.7  0.0   0:42.45 xhpl
 42055 user   20   0 3084744  73768  29124 R  99.7  0.0   0:42.46 xhpl
 42056 user   20   0 3084744  73792  29148 R  99.7  0.0   0:42.46 xhpl
 42057 user   20   0 3084744  69596  29048 R  99.7  0.0   0:42.45 xhpl
 42059 user   20   0 3084744  73724  29080 R  99.7  0.0   0:42.46 xhpl
 42060 user   20   0 3088844  75856  29160 R  99.7  0.0   0:42.45 xhpl
 42061 user   20   0 3084744  69672  29124 R  99.7  0.0   0:42.46 xhpl
 42064 user   20   0 3084744  69700  29152 R  99.7  0.0   0:42.46 xhpl
 42065 user   20   0 3086796  71720  29120 R  99.7  0.0   0:42.47 xhpl
 42066 user   20   0 3084744  73708  29064 R  99.7  0.0   0:42.45 xhpl

What I doing wrong?

A some parts of my Make.intel64 file:

MPdir        = $(I_MPI_ROOT)
MPinc        = -I$(MPdir)/intel64/include
MPlib        = $(MPdir)/intel64/lib/release_mt/libmpi.so

LAdir        = $(MKLROOT)/lib/intel64
LAinc        = -I$(MKLROOT)/include
LAlib        = -mkl=cluster

CC           = mpiicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = -qopenmp -xSKYLAKE-AVX512 -fomit-frame-pointer -O3 -funroll-loops $(HPL_DEFS)

LINKER       = mpiicc
LINKFLAGS    = $(CCFLAGS)

TimP · ‎05-28-2019

Depending on its default setting (which varies by distro), 100% in top may indicate that all threads in each process are running fully . Anyway, MKL will choose for itself the optimum number of threads, regardless of the calling application settings, if you don't set MKL specific environment variables. If all your time is spent in MKL, the top reports will indicate how those functions are running, not the calling code. More detail on this is more likely to see useful discussion in the cluster https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology or MKL https://software.intel.com/en-us/forums/intel-math-kernel-library forum sections.

McCalpinJohn · ‎05-28-2019

The "mpirun -np 18" command tells the system to launch 18 MPI tasks. The OMP_NUM_THREADS=2 tells each task to use 2 threads. This will give a total of 36 application threads, which is the correct number for maximum performance on a 2s node. I would not count on the output of "top" or "ps" to tell you the hardware utilization -- "perf stat" is a better tool for that.

For a 2s system, I typically (but not always!) find one MPI task per socket to give the best performance. Something like:

export OMP_NUM_THREADS=18
export KMP_AFFINITY=scatter
export I_MPI_DEBUG=5
perf stat -o "perf_output.txt" -a -A mpirun -np 2 ./xhpl

The output from "perf stat" will be extensive, including separate counts for each logical processor, but looking at the "CPU Cycles" and "Instructions Retired" for all of the logical processors will make it clear exactly how many actually got used for the benchmark run.