- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Holger,
Could you please tell how and which binary are you running for the LINPACK?
I noticed you refer to the documentation, which is for Mac OS, and You mentioned OpenMP don't work, But TBB seems running. are you worked with Mac OS?
MKL have release 3 benchmark
- 1) Intel® Optimized LINPACK Benchmark for Linux*
- 2) Intel® Distribution for LINPACK* Benchmark
- 3) Intel® Optimized High Performance Conjugate Gradient Benchmark
Not sure which one you are running.
But if you working with Linux, using the No.2) Intel Distribution for LINPACK Benchmark, then you may use HPL_HOST_CORE (https://software.intel.com/en-us/mkl-linux-developer-guide-environment-variables). to control number of threads and core usage.
And if you are working with Mac OS , using the No. 1)
As the documentation: https://software.intel.com/en-us/mkl-macos-developer-guide-known-limitations-of-the-intel-optimized-linpack-benchmark
Known Limitations of the Intel® Optimized LINPACK Benchmark
The following limitations are known for the Intel Optimized LINPACK Benchmark for macOS*:
- Intel Optimized LINPACK Benchmark supports only OpenMP threading
- If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file.
- The binary will hang if it is not given an input file or any other arguments.
So to set OpenMP threading should works, you may use export KMP_AFFINITY=verbose, let's see how many OPENMP threads were spawned.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Holger,
What is your exact test machine? it is 4 socket broad well system. 18 core *2 HT * 4 = 144 thread right? How was your Top looks like?
there is some discussion in
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789
Some hints:
1. the MP_LINPACK don't use OPENMP threads, so you may not use OpenMP number to control MKL threads.
2. from the output. the case 2 and case 1 should be same, but more clear affinity. from CPU usage, the Case 1 use less CPU, but good performance.
3. case 3 , which pin mpi rand to 1 node. has almost same performance of case 2.
The performance may be cause by different memory usage etc.
4. according to our experience,
MPI_PROC_NUM='The number of actual physical server, which equals PxQ)
MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'
So for your 1 node, 4 socket, it should be just simple to run > mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1
( I get almost same result if use case 1 )
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC00C2R2 10000 384 1 1 2.70 2.46543e+02
and for your 2 node , 2 socket sky lake and 4 sockets Broadwell. may be same as your affinity in case 1.
Here is the result ,which almost be able to reproduce your problem. I try on my 2 socket skylake system:
root@dell-r640:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/benchmarks/mp_linpack# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 15730 MB
node 0 free: 14356 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 16081 MB
node 1 free: 14894 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Your Case 2:
s/mp_linpack# mpirun -np 2 ./xhpl_intel64_dynamic
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 217575 dell-r640 {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}
[0] MPI startup(): 1 217576 dell-r640 {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}
...
dell-r640 : Column=009984 Fraction=0.995 Kernel= 4954.61 Mflops=205557.14
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC00C2R2 10000 384 2 1 3.87 1.72230e+02
Your Case 1:
root@dell-r640:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/benchmarks/mp_linpack# mpirun -env HPL_HOST_NODE=0 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=1 -np 1 ./xhpl_intel64_dynamic
[0] MPI startup(): Multi-threaded optimized library
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 217894 dell-r640 {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}
[0] MPI startup(): 1 217895 dell-r640 {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 1
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC00C2R2 10000 384 2 1 2.65 2.51188e+02
HPL_pdgesv() start time Wed May 9 17:20:37 2018
case 3:
s/mp_linpack# mpirun -np 2 ./xhpl_intel64_dynamic
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 220999 dell-r640 {0}
[0] MPI startup(): 1 221000 dell-r640 {16}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 16
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC00C2R2 10000 384 2 1 3.76 1.77269e+02
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Summarize so that more developer may refer.
two basic points:
MPI_PROC_NUM='The number of actual physical server, which equals PxQ)
MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'
by default, HPL will use whole resource and creating two HPL will share most of resources, so bad performance in case 2.
Thus we recommend :
A. Save #!/bin/bash
export HPL_HOST_NODE=$(($PMI_RANK * 2 + 0)),$(($PMI_RANK * 2 + 1))
./xhpl_intel64_dynamic $*
as runme script and then run
mpirun –n 2 ./runme –p 2 –q 1 -b 384 –n 40000
Or B: mpirun -env HPL_HOST_NODE=0,1 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=2, 3 -np 1 ./xhpl_intel64_dynamic (where, p=2, Q=1 in HPL.dat)
Or C. mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1 -b 384 –n 40000
Case A and Case B should be similar and as Holger's test. Case c should be almost same result as A and B. and it is also fine if you have more nodes in systems and each node have 1 mpi rank.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, what is the formula to calculate N? I searched in the documents but didn't see right formula to calculate 'N' i.e. size of the problem based on available memory of the host.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page