MPI Linpack from MKL, SSE4_2, turbo, and Skylake: SSE 4.2 threads run at the AVX2 turbo frequency

Newman__Chuck · ‎07-26-2018

This is a follow-on from this related topic in the MKL Forum
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/782951

In that situation, there were a couple of version of Intel MPI 2017.* in which a couple of the threads in each MPI process running Linpack with SSE 4.2 instructions would run at the AVX-512 turbo frequency rather than at the non-AVX frequency. (The other threads would all run at the non-AVX frequency, as expected). I'm using the version of Linpack that comes with MKL; I find it at /opt/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks/mp_linpack/, for example.

That problem was avoided by using older or newer versions of MPI.

The current problem is more subtle, and is from, I suspect, Intel MPI using "light-weight" AVX-512 instructions on Skylake; perhaps for copying data. The wikichip page https://en.wikichip.org/wiki/intel/frequency_behavior says that "light-weight" AVX-512 instructions will run at the AVX 2.0 frequency.

Typically it wouldn't be a problem for MPI communication to run at a lower frequency, as the core's frequency will drop from the non-AVX frequency to the AVX 2.0 frequency while executing these instructions, but will return to the non-AVX frequency soon after.

The problem, however, is in jitter-sensitive environments. Even there it typically won't be a problem, because the MPI code will likely be in a section that is not sensitive to jitter; other cores should not be affected and the thread doing communication should soon return to its higher frequency..

However, Hewlett Packard Enterprise has a feature called Jitter Smoothing as part of its Intelligent System Tuning feature set (https://support.hpe.com/hpsc/doc/public/display?docId=a00018313en_us), When this feature is set to "Auto" mode, the server will notice changes to the operating frequency of the cores, and will decrease the frequency of all cores on all processors accordingly. The system soon gets to a stable state in which frequency changes no longer occur, with a resulting state of minimized jitter -- but at the lower frequency of the AVX 2.0 turbo frequency.

On my servers, I have this Jitter Smoothing feature enabled. When I run, for example, the MPI version of Linpack with the environment variable MKL_ENABLE_INSTRUCTIONS=SSE4_2 on Skylake with turbo enabled, I see the operating frequency gradually drop from the non-AVX frequency to the AVX2 frequency. This happens on all cores across the server, which is what Jitter Smoothing is expected to do. I do not see this happen when I run the OpenMP (non-MPI) version of Linpack; in that case all the CPUs stay at the non-AVX frequency for the duration of the workload.

As I said, I hypothesize that MPI is using some AVX-512 instructions, perhaps to copy data, and that's causing my performance problem. If my hypothesis is correct, my question is if there is a way to tell Intel MPI to not use AVX instructions, similar to the MKL_ENABLE_INSTRUCTIONS environment variable that Intel MKL uses.

Newman__Chuck · ‎06-25-2019

I heard from a contact at Intel that this issue has been addressed in Intel MPI Version 2019. The environment variable I_MPI_SHM can be used to specify what instruction set should be used for the shared memory transport code, with the default being AVX2 for recent and current processors. Because these are lightweight AVX2 instructions, they will run at non-AVX frequencies and therefore should not reduce the operating frequency. Additionally, Intel MPI Version 2019 Update 3 has tuned the transport code for Cascade Lake for AVX2.

I've not yet had a chance to verify this.

This is documented in the MPI Version 2019 Update 3 document titled "Intel MPI Library Developer Reference for Linux OS" under Environment Variable Reference -> Environment Variables for Fabrics Control -> Shared Memory Control. That section can be found at https://software.intel.com/en-us/mpi-developer-reference-linux-shared-memory-control