- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running Linpack MKL (xhpl.2018.3.222.static) with MKL_ENABLE_INSTRUCTIONS=SSE4_2 on Skylake with turbo enabled.
I've tried this with three different releases of MKL and three different Skylake processors. They all show the same effect, but with different frequencies, of course.
The base thread of each of the MPI ranks runs at the AVX512 turbo frequency, while the other threads run at the expected non-AVX frequency.
If I specify AVX2, all threads run at the AVX 2.0 frequency, as expected
If I specify AVX512, all threads run at the AVX 512 frequency, as expected
At first I thought the SSE 4.2 run might be using 512 bit instructions.on those two CPUs, but fiddling with the performance MSRs to look at the counters shows that only the expected Floating Point Double Precision instructions are being retired.
Here are some characteristics of my Skylake processor and the Linpack run (frequencies are all-cores-active max frequencies, in GHz):
# cores/processor 8 frequency GFlops run time (sec) non-AVX turbo 4.1 2.07505e+02 222.87 AVX 2.0 turbo 3.7 8.22624e+02 56.22 AVX 512 turbo 3.0 1.30613e+03 35.41
Below is a turbostat snapshot while running with SSE4_2
(There's a bit of bouncing around of frequencies as the job runs, but you can see that the CPU 0 & 8 frequencies are low, tending toward 3.0 GHz, and the other 14 CPUs' frequencies are high, tending toward 4.1 GHz.
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_% RAM_% - - 3957 100.00 3967 3891 15914 0 0.00 0.00 0.00 0.00 69 69 317.50 0.00 0.00 0.00 4 1 4090 100.00 4100 3891 5011 0 0.00 0.00 0.00 0.00 54 8 2 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 66 9 3 4090 100.00 4100 3891 84 0 0.00 0.00 0.00 0.00 67 11 4 4090 100.00 4100 3891 8 0 0.00 0.00 0.00 0.00 63 16 0 3047 100.00 3054 3891 5626 0 0.00 0.00 0.00 0.00 55 67 153.59 0.00 0.00 0.00 18 5 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 67 19 6 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 64 25 7 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 63 1 8 3006 100.00 3013 3891 5080 0 0.00 0.00 0.00 0.00 50 69 163.91 0.00 0.00 0.00 2 9 4090 100.00 4100 3891 10 0 0.00 0.00 0.00 0.00 56 3 10 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 66 4 11 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 67 8 12 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 67 18 13 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 69 24 14 4090 100.00 4100 3891 8 0 0.00 0.00 0.00 0.00 69 27 15 4090 100.00 4100 3891 15 0 0.00 0.00 0.00 0.00 67
I used the attached script to reproduce this. It takes an optional argument for the desired setting for MKL_ENABLE_INSTRUCTIONS, defaulting to SSE4_2. It will create an HPL.dat file if it does not exist, and run Linpack with two MPI ranks.
-- Chuck Newman
:
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Especially on processors with many cores, this effect is more prominent when using only one processor.
An easy way to affect that on a 2P server with HT disabled is to replace the setting of the "Cores" array with these two lines:
NumCores=$(grep -c processor /proc/cpuinfo) Cores=(0-$((${NumCores}/4-1)) $((${NumCores}/4))-$((${NumCores}/2-1)))
Also, I should have augmented the PATH environment variable rather than overwriting it, so use this instead:
export PATH=${I_MPI_ROOT}/intel64/bin:${PATH}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
HPL main thread needs MPI operation and has to wait until transaction completes. Until data arrives, core will be suspended and it prevents from going higher frequency. Especially you're using P=1, Q=2 and only one of the MPI rank will perform panel factorization and another MPI rank is just waiting for broadcast data.
If you try P=2 and Q=1, main core frequency will be slightly higher due to better load balancing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm running with power savings disabled and booted with idle=poll, so all cores run at max frequency unless something causes them to slow down. Here's what turbostat shows on the server when it is idle -- "Bzy MHz" is 4.1 GHz:
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_% RAM_% - - 4090 100.00 4100 3891 16039 0 0.00 0.00 0.00 0.00 42 42 189.52 0.00 0.00 0.00 4 1 4090 100.00 4100 3891 5008 0 0.00 0.00 0.00 0.00 39 8 2 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 42 9 3 4090 100.00 4100 3891 81 0 0.00 0.00 0.00 0.00 42 11 4 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 40 16 0 4090 100.00 4100 3891 5603 0 0.00 0.00 0.00 0.00 42 42 95.16 0.00 0.00 0.00 18 5 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 42 19 6 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 40 25 7 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 40 1 8 4090 100.00 4100 3891 5006 0 0.00 0.00 0.00 0.00 38 41 94.36 0.00 0.00 0.00 2 9 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 39 3 10 4090 100.00 4100 3891 7 0 0.00 0.00 0.00 0.00 39 4 11 4090 100.00 4100 3891 7 0 0.00 0.00 0.00 0.00 39 8 12 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 40 18 13 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 40 24 14 4090 100.00 4100 3891 6 0 0.00 0.00 0.00 0.00 41 27 15 4090 100.00 4100 3891 273 0 0.00 0.00 0.00 0.00 39
Nevertheless, I set P=2 and Q=1 as you suggested, and I still see the two threads running at 3.0 GHz.
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_% RAM_% - - 3960 100.00 3969 3891 15959 0 0.00 0.00 0.00 0.00 71 71 320.13 0.00 0.00 0.00 4 1 4090 100.00 4100 3891 5007 0 0.00 0.00 0.00 0.00 56 8 2 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 67 9 3 4090 100.00 4100 3891 89 0 0.00 0.00 0.00 0.00 69 11 4 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 64 16 0 3016 100.00 3023 3891 5693 0 0.00 0.00 0.00 0.00 56 69 155.15 0.00 0.00 0.00 18 5 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 69 19 6 4090 100.00 4100 3891 15 0 0.00 0.00 0.00 0.00 66 25 7 4090 100.00 4100 3891 15 0 0.00 0.00 0.00 0.00 65 1 8 3076 100.00 3083 3891 5010 0 0.00 0.00 0.00 0.00 52 71 164.97 0.00 0.00 0.00 2 9 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 58 3 10 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 68 4 11 4090 100.00 4100 3891 13 0 0.00 0.00 0.00 0.00 68 8 12 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 68 18 13 4090 100.00 4100 3891 13 0 0.00 0.00 0.00 0.00 71 24 14 4090 100.00 4100 3891 13 0 0.00 0.00 0.00 0.00 70 27 15 4090 100.00 4100 3891 19 0 0.00 0.00 0.00 0.00 68
If it was power-saving that was coming in to play, it would seem very coincidental that those two cores settle at the max-all-core AVX512 turbo frequency on the three different processors I have tried this on.
I removed the "intel_pstate=disable idle=poll" boot options, and my idle server now looks like this (Avg_MHz is now very low):
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_% RAM_% - - 7 0.18 4099 3891 26391 0 99.82 0.00 0.00 0.00 31 31 127.66 0.00 0.00 0.00 0 0 31 0.76 4099 3891 10718 0 99.24 0.00 0.00 0.00 31 31 64.80 0.00 0.00 0.00 2 1 41 1.01 4099 3891 10084 0 98.99 0.00 0.00 0.00 28 3 2 0 0.00 4103 3891 3 0 100.00 0.00 0.00 0.00 29 9 3 0 0.01 4098 3891 178 0 99.99 0.00 0.00 0.00 30 16 4 0 0.00 4100 3891 3 0 100.00 0.00 0.00 0.00 28 19 5 0 0.00 4100 3891 3 0 100.00 0.00 0.00 0.00 30 26 6 0 0.00 4099 3891 3 0 100.00 0.00 0.00 0.00 27 27 7 0 0.00 4099 3891 3 0 100.00 0.00 0.00 0.00 28 0 8 41 1.00 4100 3891 5004 0 99.00 0.00 0.00 0.00 28 29 62.86 0.00 0.00 0.00 4 9 0 0.00 4102 3891 3 0 100.00 0.00 0.00 0.00 26 5 10 0 0.00 4101 3891 3 0 100.00 0.00 0.00 0.00 28 6 11 0 0.00 4100 3891 3 0 100.00 0.00 0.00 0.00 29 16 12 0 0.00 4100 3891 3 0 100.00 0.00 0.00 0.00 26 19 13 0 0.00 4101 3891 3 0 100.00 0.00 0.00 0.00 29 20 14 0 0.00 4102 3891 3 0 100.00 0.00 0.00 0.00 29 22 15 1 0.03 4100 3891 374 0 99.97 0.00 0.00 0.00 30
When Linpack is running, however, turbostat shows roughly the same as before:
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_% RAM_% - - 3955 100.00 3965 3891 16083 0 0.00 0.00 0.00 0.00 72 72 310.59 0.00 0.00 0.00 0 0 3009 100.00 3016 3891 5675 0 0.00 0.00 0.00 0.00 50 65 152.04 0.00 0.00 0.00 2 1 4090 100.00 4100 3891 5007 0 0.00 0.00 0.00 0.00 59 3 2 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 65 9 3 4090 100.00 4100 3891 279 0 0.00 0.00 0.00 0.00 62 16 4 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 58 19 5 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 65 26 6 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 62 27 7 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 63 0 8 3009 100.00 3016 3891 5003 0 0.00 0.00 0.00 0.00 47 72 158.55 0.00 0.00 0.00 4 9 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 53 5 10 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 68 6 11 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 72 16 12 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 60 19 13 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 66 20 14 4090 100.00 4100 3891 11 0 0.00 0.00 0.00 0.00 71 22 15 4090 100.00 4100 3891 12 0 0.00 0.00 0.00 0.00 69
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Remember also that this curious behavior happens on Skylake only with SSE 4.2 code; not with AVX2 or AVX512.
I tried it on a Broadwell server (E5-2690 v4) and all CPUs ran at the same frequency for SSE 4.2 I also tried it with AVX2 and saw similar, expected behavior.
The max non-AVX turbo frequency is 3.2 GHz.
All CPUs of the first processor are running at ~3.1 GHz, and ~3.20 on the second processor;
No trace here of the first thread of each MPI process running at a different speed.
(With AVX2 they all run at ~2.7 GHz; the max AVX2 turbo frequency is 2.8 GHz.)
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 99.76 3.13 2.60 0.24 0.00 0.00 0.00 0.00 0 0 0 99.66 3.07 2.60 0.34 0.00 0.00 0.00 0.00 0 1 14 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 2 1 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 3 15 99.66 3.07 2.60 0.34 0.00 0.00 0.00 0.00 0 4 2 99.66 3.07 2.60 0.34 0.00 0.00 0.00 0.00 0 5 16 99.66 3.07 2.60 0.34 0.00 0.00 0.00 0.00 0 6 3 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 8 17 99.66 3.07 2.60 0.34 0.00 0.00 0.00 0.00 0 9 4 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 10 18 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 11 5 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 12 19 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 13 6 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 0 14 20 99.65 3.07 2.60 0.35 0.00 0.00 0.00 0.00 1 0 7 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 1 21 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 2 8 99.86 3.19 2.60 0.14 0.00 0.00 0.00 0.00 1 3 22 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 4 9 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 5 23 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 6 10 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 8 24 99.87 3.19 2.60 0.13 0.00 0.00 0.00 0.00 1 9 11 99.86 3.19 2.60 0.14 0.00 0.00 0.00 0.00 1 10 25 99.86 3.18 2.60 0.14 0.00 0.00 0.00 0.00 1 11 12 99.86 3.18 2.60 0.14 0.00 0.00 0.00 0.00 1 12 26 99.86 3.18 2.60 0.14 0.00 0.00 0.00 0.00 1 13 13 99.86 3.18 2.60 0.14 0.00 0.00 0.00 0.00 1 14 27 99.86 3.18 2.60 0.14 0.00 0.00 0.00 0.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you try "export I_MPI_PLATFORM=wsm" (or nhm)? MKL_ENABLE_INSTRUCTIONS environment variable will not control MPI behavior.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your suggestion. I tried each of these but there was no change; my CPUs 0 and 8 still ran at 3.0 GHz while the other CPUs ran at 4.1 GHz.
export I_MPI_PLATFORM=wsm export I_MPI_PLATFORM=nhm export I_MPI_PLATFORM=snb export I_MPI_PLATFORM=ivb export I_MPI_PLATFORM=htn
I presume those values stand for "Westmere" and "Nehalem" respectively. I found snb, ivb, and htn. Is there a similar value for either Haswell, Broadwell, or Skylake?
Output included the following, which confuses me somewhat::
Intel(R) MPI Library, Version 2017 Build 20160721 (id: 15987) Copyright (C) 2003-2016 Intel Corporation. All rights reserved. [0] MPI startup(): Multi-threaded optimized library [1] MPI startup(): shm data transfer mode [0] MPI startup(): shm data transfer mode [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 21756 hermesn72 0 [0] MPI startup(): 1 21757 hermesn72 8 [0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm [0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0 [0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2 [0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8 [0] MPI startup(): I_MPI_PLATFORM=ivb [0] MPI startup(): I_MPI_PRINT_VERSION=yes
I'm using Intel MPI 2018.3.222, so why is it identifying itself as "Version 2017 Build 20160721"
"ps" shows that the only process running mpirun is running this executable:
/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi//intel64/bin/mpirun
It identifies itself as follows:
# /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi//intel64/bin/mpirun --version Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329) Copyright 2003-2018 Intel Corporation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay; this is interesting. I've been using the static version of the executable.
I tried with the dynamic version, and the problem goes away. With SSE 4.2 I now see all threads running at 4.1 GHz, as expected.
And, with AVX2 and AVX512 I still see most threads at the corresponding max turbo frequency of 3.7 GHz and 3.0 GHz, respectively, but the first two of each MPI rank float above that by a couple hundred MHz, which would be expected if they run non-AVX code for any length of time.
Furthermore, performance increased.for SSE 4.2
SSE4_2, performance increased from 2.07505e+02 Gflops to 2.45024e+02 Gflops
AVX2 decreased a little from 8.22624e+02 Gflops to 8.17991e+02 Gflops
AVX512 didn't change much, dropping from 1.30613e+03 Gflops to 1.30442e+03 Gflops
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It appears that the culprit is Intel MPI Version 2017 Build 20160721 and not MKL
Since it's an old version of Intel MPI, it's probably not worth pursuing.
However, let me recommend that for the next release of MKL, the linpack binaries be build with the current version of Intel MPI.
Here are the details:
Looking at the readme file at /opt/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks/mp_linpack/readme.txt, I see that the difference between the static and dynamic libraries is how they were built with Intel MPI, not how they were build with MKL:
runme_intel64_dynamic : Run script for dynamically linked MPI runme_intel64_static : Run script for statically linked MPI
When I run the runme_intel64_dynamic binary with I_MPI_PRINT_VERSION=yes, I see the following in the output -- it's using the current version of Intel MPI instead of an old version that was build in 2016:
Intel(R) MPI Library, Version 2018 Update 3 Build 20180411 (id: 18329) Copyright (C) 2003-2018 Intel Corporation. All rights reserved. [0] MPI startup(): Multi-threaded optimized library [0] MPI startup(): shm data transfer mode [1] MPI startup(): shm data transfer mode [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 23838 hermesn72 0 [0] MPI startup(): 1 23839 hermesn72 8 [0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm [0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0 [0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2 [0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8 [0] MPI startup(): I_MPI_PRINT_VERSION=yes
And, as I said, the threads all run at the expected frequency with SSE4_2, and the performance is notably better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A bit more investigation; using the dynamic version, I tried with several different versions of Intel MPI that we have in our lab, with the following results:
2016.3.210 : Good 2017.0.098 : Bad <- This is the version that was used for the static version of linpack in MKL 2018 Update 3 2017.1.132 : Bad 2017.4.196 : Good 2018.0.128 : Good 2018.1.163 : Good 2018.2.199 : Good 2018.3.222 : Good
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for your findings. We're going to check if we can use latest version of MPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still see a problem under certain very specific conditions. I have a suspicion that I know what's happening, and I suspect it's in MPI. I've posted the details in the Intel Clusters and HPC Technology forum. See https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/783492.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page