MPI Linpack from MKL, SSE4_2, turbo, and Skylake: Some SSE 4.2 threads run at the AVX512 turbo frequency

Newman__Chuck · ‎07-17-2018

Running Linpack MKL (xhpl.2018.3.222.static) with MKL_ENABLE_INSTRUCTIONS=SSE4_2 on Skylake with turbo enabled.

I've tried this with three different releases of MKL and three different Skylake processors. They all show the same effect, but with different frequencies, of course.

The base thread of each of the MPI ranks runs at the AVX512 turbo frequency, while the other threads run at the expected non-AVX frequency.
If I specify AVX2, all threads run at the AVX 2.0 frequency, as expected
If I specify AVX512, all threads run at the AVX 512 frequency, as expected

At first I thought the SSE 4.2 run might be using 512 bit instructions.on those two CPUs, but fiddling with the performance MSRs to look at the counters shows that only the expected Floating Point Double Precision instructions are being retired.

Here are some characteristics of my Skylake processor and the Linpack run (frequencies are all-cores-active max frequencies, in GHz):

# cores/processor    8
                     frequency       GFlops          run time (sec)
non-AVX turbo        4.1             2.07505e+02     222.87
AVX 2.0 turbo        3.7             8.22624e+02      56.22 
AVX 512 turbo        3.0             1.30613e+03      35.41

Below is a turbostat snapshot while running with SSE4_2

(There's a bit of bouncing around of frequencies as the job runs, but you can see that the CPU 0 & 8 frequencies are low, tending toward 3.0 GHz, and the other 14 CPUs' frequencies are high, tending toward 4.1 GHz.

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
        -       -       3957    100.00  3967    3891    15914   0       0.00    0.00    0.00    0.00    69      69      317.50  0.00    0.00    0.00
        4       1       4090    100.00  4100    3891    5011    0       0.00    0.00    0.00    0.00    54
        8       2       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    66
        9       3       4090    100.00  4100    3891    84      0       0.00    0.00    0.00    0.00    67
        11      4       4090    100.00  4100    3891    8       0       0.00    0.00    0.00    0.00    63
        16      0       3047    100.00  3054    3891    5626    0       0.00    0.00    0.00    0.00    55      67      153.59  0.00    0.00    0.00
        18      5       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    67
        19      6       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    64
        25      7       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    63
        1       8       3006    100.00  3013    3891    5080    0       0.00    0.00    0.00    0.00    50      69      163.91  0.00    0.00    0.00
        2       9       4090    100.00  4100    3891    10      0       0.00    0.00    0.00    0.00    56
        3       10      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    66
        4       11      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    67
        8       12      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    67
        18      13      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    69
        24      14      4090    100.00  4100    3891    8       0       0.00    0.00    0.00    0.00    69
        27      15      4090    100.00  4100    3891    15      0       0.00    0.00    0.00    0.00    67

I used the attached script to reproduce this. It takes an optional argument for the desired setting for MKL_ENABLE_INSTRUCTIONS, defaulting to SSE4_2. It will create an HPL.dat file if it does not exist, and run Linpack with two MPI ranks.

-- Chuck Newman

:

Newman__Chuck · ‎07-18-2018

Especially on processors with many cores, this effect is more prominent when using only one processor.
An easy way to affect that on a 2P server with HT disabled is to replace the setting of the "Cores" array with these two lines:

NumCores=$(grep -c processor /proc/cpuinfo)
Cores=(0-$((${NumCores}/4-1)) $((${NumCores}/4))-$((${NumCores}/2-1)))

Also, I should have augmented the PATH environment variable rather than overwriting it, so use this instead:

  export PATH=${I_MPI_ROOT}/intel64/bin:${PATH}

Kazushige_G_Intel · ‎07-18-2018

Hi,

HPL main thread needs MPI operation and has to wait until transaction completes. Until data arrives, core will be suspended and it prevents from going higher frequency. Especially you're using P=1, Q=2 and only one of the MPI rank will perform panel factorization and another MPI rank is just waiting for broadcast data.

If you try P=2 and Q=1, main core frequency will be slightly higher due to better load balancing.

Newman__Chuck · ‎07-18-2018

I'm running with power savings disabled and booted with idle=poll, so all cores run at max frequency unless something causes them to slow down. Here's what turbostat shows on the server when it is idle -- "Bzy MHz" is 4.1 GHz:

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
        -       -       4090    100.00  4100    3891    16039   0       0.00    0.00    0.00    0.00    42      42      189.52  0.00    0.00    0.00
        4       1       4090    100.00  4100    3891    5008    0       0.00    0.00    0.00    0.00    39
        8       2       4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    42
        9       3       4090    100.00  4100    3891    81      0       0.00    0.00    0.00    0.00    42
        11      4       4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    40
        16      0       4090    100.00  4100    3891    5603    0       0.00    0.00    0.00    0.00    42      42      95.16   0.00    0.00    0.00
        18      5       4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    42
        19      6       4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    40
        25      7       4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    40
        1       8       4090    100.00  4100    3891    5006    0       0.00    0.00    0.00    0.00    38      41      94.36   0.00    0.00    0.00
        2       9       4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    39
        3       10      4090    100.00  4100    3891    7       0       0.00    0.00    0.00    0.00    39
        4       11      4090    100.00  4100    3891    7       0       0.00    0.00    0.00    0.00    39
        8       12      4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    40
        18      13      4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    40
        24      14      4090    100.00  4100    3891    6       0       0.00    0.00    0.00    0.00    41
        27      15      4090    100.00  4100    3891    273     0       0.00    0.00    0.00    0.00    39

Nevertheless, I set P=2 and Q=1 as you suggested, and I still see the two threads running at 3.0 GHz.

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
        -       -       3960    100.00  3969    3891    15959   0       0.00    0.00    0.00    0.00    71      71      320.13  0.00    0.00    0.00
        4       1       4090    100.00  4100    3891    5007    0       0.00    0.00    0.00    0.00    56
        8       2       4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    67
        9       3       4090    100.00  4100    3891    89      0       0.00    0.00    0.00    0.00    69
        11      4       4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    64
        16      0       3016    100.00  3023    3891    5693    0       0.00    0.00    0.00    0.00    56      69      155.15  0.00    0.00    0.00
        18      5       4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    69
        19      6       4090    100.00  4100    3891    15      0       0.00    0.00    0.00    0.00    66
        25      7       4090    100.00  4100    3891    15      0       0.00    0.00    0.00    0.00    65
        1       8       3076    100.00  3083    3891    5010    0       0.00    0.00    0.00    0.00    52      71      164.97  0.00    0.00    0.00
        2       9       4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    58
        3       10      4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    68
        4       11      4090    100.00  4100    3891    13      0       0.00    0.00    0.00    0.00    68
        8       12      4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    68
        18      13      4090    100.00  4100    3891    13      0       0.00    0.00    0.00    0.00    71
        24      14      4090    100.00  4100    3891    13      0       0.00    0.00    0.00    0.00    70
        27      15      4090    100.00  4100    3891    19      0       0.00    0.00    0.00    0.00    68

If it was power-saving that was coming in to play, it would seem very coincidental that those two cores settle at the max-all-core AVX512 turbo frequency on the three different processors I have tried this on.

I removed the "intel_pstate=disable idle=poll" boot options, and my idle server now looks like this (Avg_MHz is now very low):

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
        -       -       7       0.18    4099    3891    26391   0       99.82   0.00    0.00    0.00    31      31      127.66  0.00    0.00    0.00
        0       0       31      0.76    4099    3891    10718   0       99.24   0.00    0.00    0.00    31      31      64.80   0.00    0.00    0.00
        2       1       41      1.01    4099    3891    10084   0       98.99   0.00    0.00    0.00    28
        3       2       0       0.00    4103    3891    3       0       100.00  0.00    0.00    0.00    29
        9       3       0       0.01    4098    3891    178     0       99.99   0.00    0.00    0.00    30
        16      4       0       0.00    4100    3891    3       0       100.00  0.00    0.00    0.00    28
        19      5       0       0.00    4100    3891    3       0       100.00  0.00    0.00    0.00    30
        26      6       0       0.00    4099    3891    3       0       100.00  0.00    0.00    0.00    27
        27      7       0       0.00    4099    3891    3       0       100.00  0.00    0.00    0.00    28
        0       8       41      1.00    4100    3891    5004    0       99.00   0.00    0.00    0.00    28      29      62.86   0.00    0.00    0.00
        4       9       0       0.00    4102    3891    3       0       100.00  0.00    0.00    0.00    26
        5       10      0       0.00    4101    3891    3       0       100.00  0.00    0.00    0.00    28
        6       11      0       0.00    4100    3891    3       0       100.00  0.00    0.00    0.00    29
        16      12      0       0.00    4100    3891    3       0       100.00  0.00    0.00    0.00    26
        19      13      0       0.00    4101    3891    3       0       100.00  0.00    0.00    0.00    29
        20      14      0       0.00    4102    3891    3       0       100.00  0.00    0.00    0.00    29
        22      15      1       0.03    4100    3891    374     0       99.97   0.00    0.00    0.00    30

When Linpack is running, however, turbostat shows roughly the same as before:

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
        -       -       3955    100.00  3965    3891    16083   0       0.00    0.00    0.00    0.00    72      72      310.59  0.00    0.00    0.00
        0       0       3009    100.00  3016    3891    5675    0       0.00    0.00    0.00    0.00    50      65      152.04  0.00    0.00    0.00
        2       1       4090    100.00  4100    3891    5007    0       0.00    0.00    0.00    0.00    59
        3       2       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    65
        9       3       4090    100.00  4100    3891    279     0       0.00    0.00    0.00    0.00    62
        16      4       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    58
        19      5       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    65
        26      6       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    62
        27      7       4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    63
        0       8       3009    100.00  3016    3891    5003    0       0.00    0.00    0.00    0.00    47      72      158.55  0.00    0.00    0.00
        4       9       4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    53
        5       10      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    68
        6       11      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    72
        16      12      4090    100.00  4100    3891    9       0       0.00    0.00    0.00    0.00    60
        19      13      4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    66
        20      14      4090    100.00  4100    3891    11      0       0.00    0.00    0.00    0.00    71
        22      15      4090    100.00  4100    3891    12      0       0.00    0.00    0.00    0.00    69

Newman__Chuck · ‎07-18-2018

Remember also that this curious behavior happens on Skylake only with SSE 4.2 code; not with AVX2 or AVX512.

I tried it on a Broadwell server (E5-2690 v4) and all CPUs ran at the same frequency for SSE 4.2 I also tried it with AVX2 and saw similar, expected behavior.
The max non-AVX turbo frequency is 3.2 GHz.

All CPUs of the first processor are running at ~3.1 GHz, and ~3.20 on the second processor;
No trace here of the first thread of each MPI process running at a different speed.
(With AVX2 they all run at ~2.7 GHz; the max AVX2 turbo frequency is 2.8 GHz.)

pk cor CPU    %c0  GHz  TSC    %c1    %c3    %c6   %pc3   %pc6
            99.76 3.13 2.60   0.24   0.00   0.00   0.00   0.00
 0   0   0  99.66 3.07 2.60   0.34   0.00   0.00   0.00   0.00
 0   1  14  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0   2   1  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0   3  15  99.66 3.07 2.60   0.34   0.00   0.00   0.00   0.00
 0   4   2  99.66 3.07 2.60   0.34   0.00   0.00   0.00   0.00
 0   5  16  99.66 3.07 2.60   0.34   0.00   0.00   0.00   0.00
 0   6   3  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0   8  17  99.66 3.07 2.60   0.34   0.00   0.00   0.00   0.00
 0   9   4  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0  10  18  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0  11   5  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0  12  19  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0  13   6  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 0  14  20  99.65 3.07 2.60   0.35   0.00   0.00   0.00   0.00
 1   0   7  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   1  21  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   2   8  99.86 3.19 2.60   0.14   0.00   0.00   0.00   0.00
 1   3  22  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   4   9  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   5  23  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   6  10  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   8  24  99.87 3.19 2.60   0.13   0.00   0.00   0.00   0.00
 1   9  11  99.86 3.19 2.60   0.14   0.00   0.00   0.00   0.00
 1  10  25  99.86 3.18 2.60   0.14   0.00   0.00   0.00   0.00
 1  11  12  99.86 3.18 2.60   0.14   0.00   0.00   0.00   0.00
 1  12  26  99.86 3.18 2.60   0.14   0.00   0.00   0.00   0.00
 1  13  13  99.86 3.18 2.60   0.14   0.00   0.00   0.00   0.00
 1  14  27  99.86 3.18 2.60   0.14   0.00   0.00   0.00   0.00

Kazushige_G_Intel · ‎07-18-2018

Hi,

Can you try "export I_MPI_PLATFORM=wsm" (or nhm)? MKL_ENABLE_INSTRUCTIONS environment variable will not control MPI behavior.

Newman__Chuck · ‎07-18-2018

Thanks for your suggestion. I tried each of these but there was no change; my CPUs 0 and 8 still ran at 3.0 GHz while the other CPUs ran at 4.1 GHz.

  export I_MPI_PLATFORM=wsm
  export I_MPI_PLATFORM=nhm
  export I_MPI_PLATFORM=snb
  export I_MPI_PLATFORM=ivb
  export I_MPI_PLATFORM=htn

I presume those values stand for "Westmere" and "Nehalem" respectively. I found snb, ivb, and htn. Is there a similar value for either Haswell, Broadwell, or Skylake?

Output included the following, which confuses me somewhat::

Intel(R) MPI Library, Version 2017  Build 20160721 (id: 15987)
Copyright (C) 2003-2016 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       21756    hermesn72  0
[0] MPI startup(): 1       21757    hermesn72  8
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8
[0] MPI startup(): I_MPI_PLATFORM=ivb
[0] MPI startup(): I_MPI_PRINT_VERSION=yes

I'm using Intel MPI 2018.3.222, so why is it identifying itself as "Version 2017 Build 20160721"

"ps" shows that the only process running mpirun is running this executable:
/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi//intel64/bin/mpirun

It identifies itself as follows:

# /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi//intel64/bin/mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.

Newman__Chuck · ‎07-18-2018

Okay; this is interesting. I've been using the static version of the executable.
I tried with the dynamic version, and the problem goes away. With SSE 4.2 I now see all threads running at 4.1 GHz, as expected.
And, with AVX2 and AVX512 I still see most threads at the corresponding max turbo frequency of 3.7 GHz and 3.0 GHz, respectively, but the first two of each MPI rank float above that by a couple hundred MHz, which would be expected if they run non-AVX code for any length of time.

Furthermore, performance increased.for SSE 4.2

SSE4_2, performance increased from 2.07505e+02 Gflops to 2.45024e+02 Gflops
AVX2 decreased a little from 8.22624e+02 Gflops to 8.17991e+02 Gflops
AVX512 didn't change much, dropping from 1.30613e+03 Gflops to 1.30442e+03 Gflops

Newman__Chuck · ‎07-19-2018

It appears that the culprit is Intel MPI Version 2017 Build 20160721 and not MKL
Since it's an old version of Intel MPI, it's probably not worth pursuing.
However, let me recommend that for the next release of MKL, the linpack binaries be build with the current version of Intel MPI.

Here are the details:

Looking at the readme file at /opt/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks/mp_linpack/readme.txt, I see that the difference between the static and dynamic libraries is how they were built with Intel MPI, not how they were build with MKL:

runme_intel64_dynamic     : Run script for dynamically linked MPI
runme_intel64_static      : Run script for statically linked MPI

When I run the runme_intel64_dynamic binary with I_MPI_PRINT_VERSION=yes, I see the following in the output -- it's using the current version of Intel MPI instead of an old version that was build in 2016:

Intel(R) MPI Library, Version 2018 Update 3  Build 20180411 (id: 18329)
Copyright (C) 2003-2018 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       23838    hermesn72  0
[0] MPI startup(): 1       23839    hermesn72  8
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8
[0] MPI startup(): I_MPI_PRINT_VERSION=yes

And, as I said, the threads all run at the expected frequency with SSE4_2, and the performance is notably better.

Newman__Chuck · ‎07-19-2018

A bit more investigation; using the dynamic version, I tried with several different versions of Intel MPI that we have in our lab, with the following results:

2016.3.210      : Good
2017.0.098      : Bad  <- This is the version that was used for the static version of linpack in MKL 2018 Update 3
2017.1.132      : Bad
2017.4.196      : Good
2018.0.128      : Good
2018.1.163      : Good
2018.2.199      : Good
2018.3.222      : Good

Kazushige_G_Intel · ‎07-19-2018

Thank you very much for your findings. We're going to check if we can use latest version of MPI.

Newman__Chuck · ‎07-26-2018

I still see a problem under certain very specific conditions. I have a suspicion that I know what's happening, and I suspect it's in MPI. I've posted the details in the Intel Clusters and HPC Technology forum. See https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/783492.