AVX512 slower than AVX2 with Intel MKL dgemm on Intel Gold 5118

victor-lee · ‎07-25-2019

We have been evaluating Intel Parallel Studio 19, Intel MKL and Intel Gold 5118 processors.
The Intel Gold 5118 processor supports AVX512.

During our investigation, we notice that when using AVX512, the wall clock solution time of
our product is no better than, or sometimes slower than, the wall clock solution solution
time when using AVX2.

We investigated in detail and we have created a small test case that replicates the issue.
This test case is based on John D. McCalpin's program simple-MKL-DGEMM-test, which we obtained
from github.

Please see file dgemm-test01.tgz. This tarfile includes the source code, make script and results obtained
on our Linux computer. You can see the compilation and linking options used in the file make.sh (sh make.sh)
The compilation is done using Intel Parallel Studio 19, version 19.0.4.243, and the corresponding
Intel MKL libraries are statically linked into the executable.

The file output.out gives the results from running the test program using script runtest.sh:

sh runtest.sh >& output.out

output.out shows the Linux version, output from /proc/cpuinfo and results.

The test program was run with only one core.

The test program was run with 4 options:

MKL_ENABLE_INSTRUCTIONS=SSE4_2
MKL_ENABLE_INSTRUCTIONS=AVX
MKL_ENABLE_INSTRUCTIONS=AVX2
MKL_ENABLE_INSTRUCTIONS=AVX512

perf stat was used to obtain detailed statistics for each option and the results are given in file output.out.

The results are summarized in this table:

ISA    wall clock time    instructions    instructions/cycle      CPU frequency (GHz)
---------------------------------------------------------------------------------------------------------------------
SSE4_2      61.6             6.2E12                3.16                          3.2
AVX          31.8             3.0E12                3.03                          3.1
AVX2      16.8             1.8E12          3.47                          3.1
AVX512       17.2             7.6E11                1.52                          2.9

It is clear that the number of instructions / cycle is much worse for AVX512, and this causes the slowdown
compared to AVX2.

The CPU frequency is nearly the same for all of the ISAs (only one core is used, so the CPU is in turbo mode),
so the effect of slower CPU frequency for AVX512 workloads is not so important here.

My questions are

1) What is the cause of the low instructions/cycle for AVX512?

2) Is there anything that we can do to increase the instructions/cycle for AVX512 on our computer?

3) Is the trend in decrease of instructions/cycle for AVX512 common to all Skylake processors, or are there Skylake processors that do
not have this decrease in instructions/cycle?

Thanks in advance,

Victor

McCalpinJohn · ‎07-26-2019

This is the expected result on the Xeon Gold 5000-series processors (except the Xeon Gold 5122/5222).

If you look on the Intel product page for your processor (https://ark.intel.com/content/www/us/en/ark/products/120473/intel-xeon-gold-5118-processor-16-5m-cache-2-30-ghz.html), there is an entry near the bottom of the page for "# of AVX-512 FMA Units", where it reports that the processor has 1.

A single AVX-512 unit has the same peak performance per cycle as the two 256-bit AVX2 units in the processor, but typically runs at a lower frequency.

The Xeon Bronze 3000, Xeon Silver 4000, and Xeon Gold 5000 processors have one AVX-512 unit per core (except the Xeon Gold 5122/5222, which has two).

The Xeon Gold 6000 processors and Xeon Platinum 8000 processors all have two AVX-512 units per core.

victor-lee · ‎07-26-2019

Hello John,

Thank you for posting. Can you confirm that my understanding is correct? I apologize for my limited understanding of this issue.

On a single core of an Intel 5118 Gold processor, a stream of AVX2 instructions is passed to the two AVX2 units, so that the AVX2 instructions are processed in parallel. This parallel effect increases the number of instructions per cycle. But a stream of AVX512 instructions cannot be processed in parallel because there is only one AVX512 unit on the core, so the number of instructions per cycle is not increased.

So there is no advantage in using AVX512 on this processor, at least not for dgemm calculations. And AVX512 becomes even worse relative to AVX2 when using SMP because the cpu frequency slowdown is more pronounced with AVX512 workloads.

Thanks in advance,

Victor

McCalpinJohn · ‎07-29-2019

Your summary is correct.

On the processors with one AVX-512 unit, the AVX-512 instruction set might provide some performance advantages for codes that can exploit its special features (masking, gather/scatter instructions, etc), but I can't point to any specific examples.

victor-lee · ‎08-05-2019

Hello John,

I looked at the Intel product page for the 5118 processor and I can see the entry "# of AVX-512 FMA Units = 1". But I don't see anywhere on that product page the number of 256-bit AVX2 units. How can I determine the number of AVX2 units for a given processor? Do all Gold processors have 2 AVX2 units?

Thanks in advance,

Victor

McCalpinJohn · ‎08-05-2019

As far as I can tell, all Intel Haswell, Broadwell, Skylake (client), Skylake (server), and the client and server Skylake follow-on processors all have two 256-bit AVX2+FMA units.

One place where the distinction is mentioned is in Chapter 2 of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-041, April 2019). In section 2.1 "The Skylake Server Microarchitecture", the text notes:

The green stars in Figure 2-1 represent new features in Skylake Server microarchitecture compared to Skylake microarchitecture for client; a 1 MiB L2 cache and an additional Intel AVX-512 FMA unit on port 5 which is available on some parts. (emphasis added)

There is also a footnote above the new feature list containing a typical disclaimer, "Some features may not be available on all products."

In addition to the "green stars" in Figure 2-1, there is a red box labelled "AVX-512 Port Fusion", that includes all the vector functions of Port 0 and Port 1. Port 0 and Port 1 are the locations of the 256-bit vector FMA units for Haswell/Broadwell, Skylake (client), Skylake (server), and newer processors. These two units are logically combined to create the single AVX-512 unit on the "low end" Xeon Scalable processors. While it is possible to implement AVX-256 instructions using 128-bit FMA units (as in AMD's first-generation EPYC processors), I don't know of any Intel processors that implement the AVX2 instruction set without also including two full 256-bit pipelines.

victor-lee · ‎08-23-2019

Hello John,

Thank you for your answer. My understanding is that, for a CPU with only one AVX-512 unit/core, it is better not to use AVX512 instructions, but to use AVX2 instructions instead. However, for a CPU with two AVX-512 units/core, it is better to use AVX512 instructions. (The above observation based on workflows dominated by DGEMM calls.)

So I would like to auto-detect at run time whether the CPU has one or two AVX-512 units/core. I can use the _may_i_use_cpu_feature intrinsic to detect if the CPU supports AVX512 instructions. But I don't know how to detect the number of AVX-512 units/core.

I suppose that I could time DGEMM calls with and without enabling AVX512 instructions, and if the calls are faster with AVX512, then I can assume that there are two AVX-512 units/core. But this approach seems inelegant.

Thanks in advance,

Victor

McCalpinJohn · ‎08-24-2019

I don't know of any feature that will tell you the number of AVX512 units. If there was such an interface, I would have expected Intel to use it in MKL and in their optimized LINPACK benchmark to switch to 256-bit vectors on the Xeon Silver and Gold processors that have only one AVX512 unit.

Even if a processor has two AVX512 units, it is often faster to use 256-bit SIMD instructions because of their advantage in maximum Turbo frequency. Intel compilers will typically generate 256-bit SIMD instructions for the CORE-AVX512 target. The documentation suggests that the compiler attempts to model performance with different vector widths and chooses the best. On the Xeon Platinum 8160, for example, the maximum all-core (24-core) Turbo frequency for "high-current" AVX512 instructions is 2.0 GHz, while the corresponding value for "high-current" 256-bit instructions is 2.5 GHz. The 512-bit SIMD instructions would have to reduce estimated execution time (in cycles) by more than 20% relative to the 256-bit SIMD version to make AVX-512 worthwhile.

Jun_Y_Intel · ‎10-09-2019

Hi John, I doubt the second AVX512 unit can run AVX2 instructions, do you have the data on your Xeon Platinum 8160 with MKL_ENABLE_INSTRUCTIONS=AVX2 ?

Thanks

Jason Ye

McCalpinJohn · ‎10-10-2019

The second AVX-512 unit does not execute AVX-256 instructions -- these are executed in the 256-bit units behind ports 0 and 1.
In AVX-512 mode, the 256-bit execution units behind ports 0 & 1 are "fused" into a 512-bit AVX-512 unit, and (on parts with a second AVX-512 unit) the second unit is accessed via port 5. This is all described in Figure 2-2 of the Intel Architectures Optimization Reference Manual (document 248966-042b, September 2019).

Jun_Y_Intel · ‎10-10-2019

Yes, :). For GEMM or LINPACK like application, we definitely need AVX512 code to get the best perf on processors with 2 AVX512 units, while use AVX2 on processors with one AVX512 units may get slightly higher perf due to higher base and turbo frequency

In your previous comments, you mentioned "Even if a processor has two AVX512 units, it is often faster to use 256-bit SIMD instructions because of their advantage in maximum Turbo frequency. " which implies both AVX512 units can run fused AVX2 instructions

McCalpinJohn · ‎10-11-2019

> [...] "Even if a processor has two AVX512 units, it is often faster to use 256-bit SIMD instructions because of their advantage in maximum Turbo frequency. " which implies both AVX512 units can run fused AVX2 instructions [...]

I don't think that implication can be derived from my statement? Sorry if I was not clear....

I have not tested the CORE_POWER.LVL*_TURBO_LICENSE values for AVX-512 instructions using 256-bit register operands (or smaller). The compiler generates AVX or AVX2 instructions for 256-bit SIMD unless specific AVX-512 functionality is required (e.g., masks). If these instructions correspond to a "Level 2 Turbo License", then the core will run at the same frequency as it does with AVX/AVX2 instructions, but if they require "Level 3 Turbo License" then performance would probably be better using 512-bit register operands (since the frequency would be the same).

I did notice in the STREAM benchmark that when the loops run for much longer than 1 millisecond the "Copy" kernel (which uses only load and store instructions) runs at a "Turbo License" level that is one step lower than the "Scale", "Add", and "Triad" kernels (all of which perform 64-bit arithmetic in addition to loads and stores), with correspondingly higher frequency. This contributes to an average frequency that does not make much sense -- e.g., about 20% of the time running at 2.5 GHz and 80% of the time running at 2.0 GHz (on the Xeon Platinum 8160, using all cores and COMMON-AVX512 code).

McCalpinJohn · ‎10-28-2019

Unexpected follow-up on this issue....

I was surprised to see that it is possible to triple-issue some 256-bit SIMD instructions on both SKL and SKX cores. Now that I am looking at the numbers again, I vaguely remember that Chapter 2 of the Intel Optimization Reference Manual shows "Vec ALU" behind ports 0, 1, and 5 of both SKL and SKX, and that Agner Fog's Instruction Tables (https://www.agner.org/optimize/instruction_tables.pdf) show a "reciprocal throughput" of 0.33 (=> 3 instructions per cycle) for a small number of SIMD instructions on both processors....

Some of the instructions with a reported throughput of 3 instructions per cycle are register-to-register move instructions, which can be hard to benchmark because the register rename stage can sometimes eliminate the need to execute the move by sneaky renaming tricks. However, there are other instructions that perform work on the data, so unambiguous benchmarks are easier. Examples include VPBLEND[all data types], VPADD[BWDQ], VPAND/NAND/OR/XOR[all data types]. If I neglect the register-to-register move instructions, it looks like Agner's instruction tables show that both implementations can triple-issue the same subset of "vector ALU" instructions.

Travis_D_ · ‎11-11-2019

McCalpin, John (Blackbelt) wrote:
Unexpected follow-up on this issue....
I was surprised to see that it is possible to triple-issue some 256-bit SIMD instructions on both SKL and SKX cores. Now that I am looking at the numbers again, I vaguely remember that Chapter 2 of the Intel Optimization Reference Manual shows "Vec ALU" behind ports 0, 1, and 5 of both SKL and SKX, and that Agner Fog's Instruction Tables (https://www.agner.org/optimize/instruction_tables.pdf) show a "reciprocal throughput" of 0.33 (=> 3 instructions per cycle) for a small number of SIMD instructions on both processors....
Some of the instructions with a reported throughput of 3 instructions per cycle are register-to-register move instructions, which can be hard to benchmark because the register rename stage can sometimes eliminate the need to execute the move by sneaky renaming tricks. However, there are other instructions that perform work on the data, so unambiguous benchmarks are easier. Examples include VPBLEND[all data types], VPADD[BWDQ], VPAND/NAND/OR/XOR[all data types]. If I neglect the register-to-register move instructions, it looks like Agner's instruction tables show that both implementations can triple-issue the same subset of "vector ALU" instructions.

Yes, since at least Ivy Bridge, there have been 3x 256-bit vector units which means the possibility exists to execute 3 vector instructions per clock (e.g., IVB can do it for bitwise operations like vpxor and friends). Probably there were also three units in Sandy Bridge, but perhaps no operations that could go to all units. On Haswell and later these three units are on p0, p1 and p5.

All mainstream (i.e., not talking about KNL and friends) AVX-512 capable machines have 2 AVX-512 units. There are no modern Intel CPUs with 1 AVX-512 unit. What you do have is CPUs with only one FMA unit, the one on p0. Some chips have a second FMA unit on p5 - but either way chips have AVX-512 capable units on both p0 and p5, it's just a question of whether FMA ops can go to p5 or not. The p5 unit can always do most integer ops and shuffles, regardless of the FMA status.

While AVX-512 instructions are in the scheduler, p1 is shut down to any SIMD ops, so even if you use 128-bit or 256-bit ops they can't go to p1: the machine acts as 2-wide for SIMD purposes.

McCalpinJohn · ‎11-12-2019

Sorry for being a bit sloppy --- Intel describes the Xeon Bronze and Xeon Gold 5000 series (except 5x22) as having "1 AVX-512 FMA unit".

From a little bit of testing on a Xeon Gold 5120 (with a single AVX-512 FMA unit), Port 5 still has the "non-floating-point" functions. I was able to sustain three operations per cycle for 256-bit VPXORD instructions, as expected.

The CORE_POWER.* performance counters give strange results on this system (compared to the 2-FMA models that I am used to), but I don't have time to investigate this further.....

huang__xinsheng · ‎12-21-2019

Hello,

I meet similar issue that the avx512/avx2/avx have similar performance on the following CPU "Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz". The same issue happens even I update the MKL to 2019 update 5, which is released recently on 201908.

It is weird avx512 does not bring better performance such the matmul (1*2048, 2048*4096), and it always cost about 2.5ms for avx512/avx2/avx.

Please share if anybody have some clue about this issue, thanks a lot.

MKL lib used in this case:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 5 Product build 20190502 for Intel(R) 64 architecture Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled
processors, Lnx 2.50GHz lp64 intel_thread

test result:
export OMP_NUM_THREADS=1; export MKL_ENABLE_INSTRUCTIONS=SSE4_2; ./a.out
MKL_VERBOSE SGEMM(N,N,4096,1,2048,0x7ffd7cfe2258,0x7f46e836c010,4096,0x942340,2048,0x7ffd7cfe2260,0x944350,4096) 4.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

export OMP_NUM_THREADS=1; export MKL_ENABLE_INSTRUCTIONS=AVX; ./a.out
MKL_VERBOSE SGEMM(N,N,4096,1,2048,0x7ffc9437de08,0x7fdd8e848010,4096,0x1dd1340,2048,0x7ffc9437de10,0x1dd3350,4096) 2.86ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

export OMP_NUM_THREADS=1; export MKL_ENABLE_INSTRUCTIONS=AVX2; ./a.out
MKL_VERBOSE SGEMM(N,N,4096,1,2048,0x7fff2880c328,0x7f82ad953010,4096,0x17a0340,2048,0x7fff2880c330,0x17a2350,4096) 2.52ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

export OMP_NUM_THREADS=1; export MKL_ENABLE_INSTRUCTIONS=AVX512; ./a.out
MKL_VERBOSE SGEMM(N,N,4096,1,2048,0x7fff08740a88,0x7f5e7fcd3010,4096,0x114b340,2048,0x7fff08740a90,0x114d350,4096) 2.47ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1