I am seeking help trying to recreate MKL HPL results from the following site running the HPL binaries supplied with ICC V16.
First, I have a very simple question that I have been unable to find a specific answer to. If I am running on a single node, single socket, 22 core E5-2699v4 with HyperThreading disabled, what should the value of P and Q be in HPL.dat?
HPL.dat generator sites such as http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/ suggest PxQ should equal the number of cores per node, which would be 22, but since MKL HPL will itself spawn threads to the max 22 cores, should P and Q both equal 1?
On a side note, is there any way to prevent MKL from spawning threads to complete the work. I tried OMP_NUM_THREADS=1 and KMP_NUM_THREADS=1, as well as other env variables suggested in docs, but none seem to over-ride.
First, you could get the latest Intel Optimized MP LINPACK (MKL HPL) binaries from here: https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite. This site is regularly updated with latest MKL benchmarks optimized for Intel processors.
For a single socket system, you could use P=Q=1. MKL HPL spawns multiple threads to exploit multi/many-cores, and it does not require running 1 MPI process per core. Our recommendation is to run 1 MPI process per socket to get the best performance.
Currently MKL HPL does not react to OMP_NUM_THREADS or any other OpenMP environment variables. Although this is not recommended for performance, if you really would like to run 1 MPI process per core, you could use HPL_HOST_CORE environment variable: https://software.intel.com/en-us/node/528634. Here, we need to specify the core id for each MPI process. For example, if you have 22 MPI processes, we need to set HPL_HOST_CORE=0 for the first MPI process, HPL_HOST_CORE=1 for the second MPI process, and so on. With this configuration, you could choose P and Q values such that PxQ=22.
Thank you, Efe.
Thanks for the response. The results are now in line w/ expectations on my single socket system when using P=Q=1 w/ the MKL HPL binary.
You stated "Currently MKL HPL does react to OMP_NUM_THREADS or any other OpenMP environment variables." Did you mean to say "does not"?
I will need to research how to set HPL_HOST_CORE for each MPI process - not clear to me how to do so.
Also, are the compile options used for the MKL HPL binary published anywhere?
Yes, I meant that MKL HPL does not react to OpenMP environment variables. Sorry about the typo. I fixed it above.
If you would like to run 1 MPI per core, you could put the below commands to a ./runme script:
and, run this as follows:
mpirun -np 22 ./runme
Compiler flags for MP LINPACK are: -fPIC -z noexecstack -z relro -pie -z now -O3 . I do not think they are published currently, but we probably should.
Thank you for the information.
In your previous post you stated "MKL HPL spawns multiple threads to exploit multi/many-cores". I am familiar with the auto-par compiler option that allows this to occur for optimal SPEC_speed results, but I do not see that option specified in the flags you note for MP LINPACK. Is there a specific option that allows this for MP LINPACK? If not, how does it get done?
It does appear that P=Q=1 for my single node, single socket 22 core E5-2699v4 with HT disabled, Turbo disabled running 1 MPI process gives best results at ~700 GFLOPs/sec.
We do not rely on compiler options for auto parallelization. We modified the NETLIB HPL code to explicitly use OS threads for parallelism in Intel Optimized MP LINPACK benchmark.
700 GFLOP/sec is within our expected performance range for the 22 core E5-2699v4 system.
Thanks for the reply.
Your suggestion for 1 MPI per core worked, though I have not had the chance to research the use of PMI_RANK. GFLOP results for any 1 MPI per core runs were the same no matter what core number was used which I suspect makes sense since there is no parallelization in the 1 MPI per core case.
Interestingly, both the Turbo On and Turbo Off results were the same with 1 MPI per socket w/ all 2 cores enabled and HT disabled. This shows that thermal conditions probably prevented any boost in frequency w/ Turbo on. PCM data confirmed as power was at 145 watts, spec'd TDP, and AFREQ was almost 1 in both Turbo On and Turbo Off runs. With just a single core enabled, Turbo On results were 60% better than Turbo Off which is in line w/ expectations since for 2699v4 max single core Turbo On frequency is 64% higher (3.6GHz) than max Turbo Off frequency (2.2GHz).
Since this benchmark is called "Intel Optimized MP LINPACK benchmark" and the NETLIB HPL code was modified to allow parallelism, is this still considered the HPL benchmark? Seems like the modifications performed do not allow for fair comparison to binaries created with the unmodified HPL source code, or am I missing something?
Another follow-up question.
If I run 1 MPI per core with 22 cores enabled, "mpirun -np 22 ./runme" , should the output results be multiplied by 22 ?
If yes, shouldn't this result be similar to the 1 MPI per socket results that parallelize the run across all 22 cores ?
We do not need to multiply the output results. Are you modifying HPL.dat for 1 MPI per core runs? We need to set PxQ=22 for 22 MPI processes.
1 MPI per socket is the recommended configuration for the best performance. Therefore, I would expect that the 1 MPI per core run will be slower than the 1 MPI per socket run.
NETLIB HPL is the reference implementation. For Top500 submissions, it is allowed to modify the source code with certain restrictions such as the matrix multiplication should be the standard triple-nested loop algorithm (no Strassen).
If you are not allowed to change the source code of the reference NETLIB HPL implementation, you could link NETLIB HPL with Intel MKL. You can use Make.Linux_Intel64 makefile to link NETLIB HPL with the Intel MKL library.
Thank you, Efe.
Yes, I missed the fact that PxQ=22 for the MPI per core run. It was about 10% worse than the P=Q=1 MPI per socket run.
That brings the question of what should the metric be when reporting a score. MPI per core would seem more appropriate, but I can find no additional information. MPI per core in the way you suggested it be run, would also eliminate the benefit seen from parallelizing the NETLIB HPL code in the P=Q=1 MPI per socket case.
When people report results, what should be assumed - MPI per core?
Please forgive the ignorance here as I am very thankful for your help.
People report the overall performance of the benchmark for the whole system. This is the GFlop/sec number printed at the very end of the benchmark run. You can then calculate what is the performance per core, but this is not a commonly used metric.
Why would you like to eliminate benefit from parallelizing NETLIB HPL code? We multithread NETLIB HPL code to improve the performance. This way we move some of the MPI-level parallelism to multithreading, which is typically more efficiency on multicore systems.
May I ask some questions about the HPL benchmark from Intel.
Performance as today is worse than compiling the NETLIB HPL code and launching it with mpirun. I was expecting to find better results on Xeon Gold 6248 using the Intel HPL benchmark, but the results are 25% worse than NETLIB HPL compiled with Parallel Studio XE 2020.1.
It take a while to understand this method of relying on threads for the HPL benchmark from Intel. This is not documented at all, and I was only able to find this out with this thread.
So what's the status today? HPL from Intel still gives the best results? Because I can't see it.