Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

LINPACK with multiple MPI ranks

Holger_A_
Beginner
1,887 Views
Hello, to benchmark our new Skylake cluster consisting of two and four socket machines together with a Broadwell system, I want to be able to run LINPACK with a different amount of MPI ranks per node. My problem is that there are too many processes spawned on the four socket nodes where I launch two MPI ranks. I tried to limit the number of threads via OMP_NUM_THREADS and MKL_NUM_THREADS, but without effect. TBB seams to be the cause here, because some MKL functions (which will probably be used in LINPACK) are using this: https://software.intel.com/en-us/mkl-macos-developer-guide-functions-threaded-with-intel-threading-building-blocks As far as I know, there is no possibility to influence the number of threads with environment variables created with TBB. So my question is, how to run LINPACK with two MPI ranks on one node (and get the full performance)? Best regards, Holger
0 Kudos
5 Replies
Ying_H_Intel
Employee
1,887 Views

 

Hi Holger,
Could you please tell how and which binary are you running for the LINPACK?
​I noticed you refer to the documentation, which is for Mac OS,  and You mentioned OpenMP don't work, But TBB seems running.  are you worked with Mac OS?

MKL have release 3 benchmark

Not sure which one you are running.

But if you working with  Linux,  using the No.2)  Intel Distribution for LINPACK Benchmark,  then you may  use HPL_HOST_CORE (https://software.intel.com/en-us/mkl-linux-developer-guide-environment-variables). to control number of threads and core usage.

And if you are working with Mac OS , using the No. 1)

​As the documentation:   https://software.intel.com/en-us/mkl-macos-developer-guide-known-limitations-of-the-intel-optimized-linpack-benchmark

 

Known Limitations of the Intel® Optimized LINPACK Benchmark

The following limitations are known for the Intel Optimized LINPACK Benchmark for macOS*:

  • Intel Optimized LINPACK Benchmark supports only OpenMP threading
  • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file.
  • The binary will hang if it is not given an input file or any other arguments.

So to set OpenMP threading should works,  you may use export KMP_AFFINITY=verbose, let's see how many OPENMP threads were spawned.

​Best Regards,
​Ying

0 Kudos
Holger_A_
Beginner
1,887 Views
Hi Ying, thank you for your reply. In fact, the usage of HPL_HOST_CORE , respectively HPL_HOST_NODE is helping me. I am using Linpack from MKL under Linux, so most probably option 2). What I do at the moment is the following (see output_linpack1.txt in the attachment): mpirun -machinefile $M_NAME -host r05n01 -env HPL_HOST_NODE=0,1 -np 1 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic : -host r05n01 -env HPL_HOST_NODE=2,3 -np 1 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic This delivers about 5.8 TFLOPs at the beginning of Linpack. When I run (output_linpack2.txt) mpirun -machinefile $M_NAME -np 2 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic on the same machine, the output is the same regarding the thread placement, I think, but you only get 2 TFLOPs As I test, I ran (output_linpack3.txt) export I_MPI_PIN_DOMAIN=omp export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 mpirun -machinefile $M_NAME -np $NUM_PROCS /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic Now it claims to start one thread per MPI process, but in fact (top says so), also starts 36 threads per process. I don't understand, why multiple processes are spawned here. Therefore I think that in the second example there are also more threads per process and both MPI ranks are trying to use the whole machine. Unfortunately to use this four socket machine together with our two socket nodes, I would like to be able to start two MPI processes. Best regards, Holger
0 Kudos
Ying_H_Intel
Employee
1,887 Views

 

Hi Holger,

​What is your exact test machine?   it is 4 socket broad well system.   18 core *2 HT * 4 = 144 thread right?  How was your Top looks like?

there is some discussion in
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789

​Some hints:

​1.  the MP_LINPACK don't use OPENMP threads, so you may not use OpenMP number to control  MKL threads.

​2. from the output. the case 2 and case 1 should be same, but more clear affinity.  from CPU usage,  the Case 1 use less CPU, but good performance.

3.  case 3 , which pin mpi rand to 1 node.  has almost same performance of case 2.

The performance may be cause by different memory usage etc.
4. according to our experience,

MPI_PROC_NUM='The number of actual physical server, which equals PxQ)

MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown  to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'

So for your 1 node, 4 socket, it should be  just simple to run >    mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1 

( I get almost same result  if use case 1 )

T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2       10000   384     1     1               2.70            2.46543e+02

​and for your 2 node , 2 socket sky lake   and 4 sockets Broadwell.  may be same as your affinity in case 1.  

Here is the result ,which almost be able to reproduce your problem.  I  try on my 2 socket skylake system:

root@dell-r640:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/benchmarks/mp_linpack# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 15730 MB
node 0 free: 14356 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 16081 MB
node 1 free: 14894 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Your Case 2:

s/mp_linpack# mpirun -np 2 ./xhpl_intel64_dynamic
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       217575   dell-r640  {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}
[0] MPI startup(): 1       217576   dell-r640  {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}
...
dell-r640       : Column=009984 Fraction=0.995 Kernel= 4954.61 Mflops=205557.14
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2       10000   384     2     1               3.87            1.72230e+02

Your Case 1:

root@dell-r640:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/benchmarks/mp_linpack# mpirun -env HPL_HOST_NODE=0 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=1 -np 1 ./xhpl_intel64_dynamic
[0] MPI startup(): Multi-threaded optimized library
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       217894   dell-r640  {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}
[0] MPI startup(): 1       217895   dell-r640  {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 1
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2       10000   384     2     1               2.65            2.51188e+02
HPL_pdgesv() start time Wed May  9 17:20:37 2018
 

case 3:
s/mp_linpack# mpirun -np 2 ./xhpl_intel64_dynamic
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       220999   dell-r640  {0}
[0] MPI startup(): 1       221000   dell-r640  {16}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 16
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2       10000   384     2     1               3.76            1.77269e+02

Best Regards,
​Ying

 

 

0 Kudos
Ying_H_Intel
Employee
1,887 Views

 

Summarize so that more developer may refer.

two basic points:

MPI_PROC_NUM='The number of actual physical server, which equals PxQ)

MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown  to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'

​by default, HPL will use whole resource and creating two HPL will share most of resources, so  bad performance in case 2.

Thus we recommend :

A.     Save  #!/bin/bash
export HPL_HOST_NODE=$(($PMI_RANK * 2 + 0)),$(($PMI_RANK * 2 + 1))

               ./xhpl_intel64_dynamic $*

as runme script and then run

mpirun –n 2 ./runme –p 2 –q 1 -b 384 –n 40000

 

 

 

Or  B:  mpirun -env HPL_HOST_NODE=0,1 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=2, 3 -np 1 ./xhpl_intel64_dynamic  (where, p=2, Q=1 in HPL.dat)

Or   C.  mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1  -b 384 –n 40000

​Case A and Case B should be similar and as Holger's test.  Case c should be almost same result as A and B.  and it is also fine if you have more nodes in systems and each node have 1 mpi rank.  

​Best Regards,

​Ying  

0 Kudos
Anup_N_Intel
Employee
1,887 Views

Hi, what is the formula to calculate N? I searched in the documents but didn't see right formula to calculate 'N' i.e. size of the problem based on available memory of the host.

0 Kudos
Reply