Early indicators of AVX512 performance on Skylake?

angus-hewlett · ‎12-18-2014

Hi all,

Looking ahead, what can we expect from the first generation of AVX512 on the desktop - or when should we expect an announcement?

In the past:

- The first generations of SSE CPUs didn't have a full-width engine, they broke 128-bit SSE operations in to two 64-bit uOps

- The first AVX CPUs (Sandy Bridge / Ivy Bridge) needed two cycles for an AVX store - the L1 cache didn't have the bandwidth to perform a store in one cycle

So what I'd like to know is:

- Will the AVX512 desktop CPUs be able to handle a full-width L1 load and store per cycle?

- Will they retain Broadwell's (fantastic) dual-issue, 3-cycle latency VFMUL/VFMADD unit but widened to 512 bits?

Thanks much for any light you can shed,

Angus.

bronxzv · ‎12-18-2014

angus-hewlett wrote:
Will they retain Broadwell's (fantastic) dual-issue, 3-cycle latency VFMUL/VFMADD unit but widened to 512 bits?

note that although VMULPx latency is reduced to 3 clocks on Broadwell, VFMADDx latency is unchanged at 5 clocks

for (Xeon only ?) Skylake we know that peak flops per core will be doubled from https://software.intel.com/en-us/blogs/2013/avx-512-instructions

The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.

so we can assume dual 512-bit FMA units per core in the 1st products, now it tells nothing about effective flops, i.e. we don't know if the lackluster (for 256-bit AVX) load/store and cache badnwidth of Sandy Bridge will repeat

angus-hewlett · ‎12-18-2014

thanks bronxzv, that's useful to know - not least because the "add" in FMA can no longer be considered entirely free (as it effectively was in Haswell). In some circumstances it may be better not to fuse. Are the compilers up to figuring this stuff out yet? AFAIK fma is typically invoked from intrinsics.. fun.

It would be a big departure to include AVX512 only on part of the lineup - I'd have thought it's the biggest new feature, and will represent a substantial chunk of the core. That 32-entry register file is going to be nice to have as well - with 5 cycles latency on an FMA, if I write tight loops they end up slowed down by data dependencies, but a looser loop (interleaving a few streams of basically identical instructions) ends up saturating the load/store units. With 32R I can have three interleaved streams with ten registers per stream, far fewer spills and refills.

bronxzv · ‎12-18-2014

angus-hewlett wrote:

thanks bronxzv, that's useful to know - not least because the "add" in FMA can no longer be considered entirely free (as it effectively was in Haswell). In some circumstances it may be better not to fuse. Are the compilers up to figuring this stuff out yet? AFAIK fma is typically invoked from intrinsics.. fun.

FMA is well generated by the compilers from high-level C++ code (vectorized and scalar code), case in point the Intel compiler was alright for FMA roughly 1 year ahead of the Haswell launch, moreover legacy code based on vmulpx/vaddpx intrinsics generate FMA code when targeting AVX2, all in all I don't see the point to use the dedicated FMA intrinsics

I'm not sure if there is already optimizations to not fuse based on critical path latency for Broadwell targets, though

angus-hewlett wrote:
It would be a big departure to include AVX512 only on part of the lineup

it looks rather odd indeed, probably on die but disabled if it's really how things pan out, as in the past with hyperthreading for example, enabled only for 180 nm Xeon (Foster MP) not 180 nm desktop (Willamette) even that the uarchitecture was (supposedly) the same

McCalpinJohn · ‎12-20-2014

Given the fairly clear statement that AVX-512 will have 2x the peak performance per cycle of AVX2 and the importance of being able to demonstrate that doubling on the LINPACK benchmark, I believe that AVX-512 systems will have to support two 512-bit (64 Byte) loads per cycle from the L1 Data Cache.

Looking historically (and counting 1 64-bit multiply plus 1 64-bit add as an FMA):

Processor       FMAs/Cycle    Loads/Cycle (doubles) Ratio
Nehalem               2                         2                              1
Sandy Bridge        4                         4                              1
Xeon Phi               8                         8                              1
Haswell                 8                         8                              1

Looking at the optimization options for a register-blocked DGEMM kernel (which constitutes the bulk of the work in the LINPACK benchmark), it is clear that it is possible to reduce the requirements for loads to less than one load per FMA by careful unrolling and register blocking, but this is not without its own problems, and it appears that these would prevent an implementation from reaching (nominal) full performance with a sustained L1 cache bandwidth of 1/2 load per FMA (which is the next logical design point).

Problem 1: For SIMD machines one of the array blocks being loaded needs to be transposed before use. There are two basic approaches to performing the transposition: (1) load the data into SIMD registers and then transpose it using register-to-register permute instructions, or (2) load the elements individually with broadcast across the locations of the SIMD register. In both cases the overhead (in instructions and in registers to be written) grows linearly with the width of the SIMD registers. In the first case you typically run into issue limitations on the port that can perform permute operations, while in the second case you run out of load bandwidth from the cache (and are obviously negating at least a portion of the reduction in the number of loads that you were trying to obtain by register blocking). Hybrids are possible (though ugly), and several Intel presentations describe such hybrid optimizations -- some of the transposition operations are performed using register-to-register permute functions and some of the transposition operations are performed by reloading the data with broadcast.

Problem 2: Register blocking becomes quickly limited by the available register names. E.g., for a Haswell core with a 5-cycle FMA dependent-operation latency and 2 FMA units, you need a minimum of 10 named registers for accumulators. Depending on exactly how the unrolling was done, you also need 4-5 named registers to hold the values that you want to re-use from registers (rather than reload from cache), and it is not possible for all of these uses to be close together in the instruction stream. This leaves 1-2 AVX register names for the streaming accesses, which is nominally enough, but is also clearly cutting it very close.

I have sketched out the optimizations required to implement the register-blocked DGEMM kernel on AVX-512 with two 512-bit FMA units, and the transpositions required make it challenging to get (nominally) full performance even with 2 512-bit loads per cycle. Increasing the number of registers from 16 to 32 (as AVX-512 provides) appears mandatory. This allows doubling the unrolling, which allows you to double the re-use of the transposed values, which is just what you need to compensate for the transposes taking twice as many operations (due to the doubled SIMD width).

These notes only consider the L1 load bandwidth issues. DGEMM is typically blocked for L2 cache as well as unrolled for register re-use. The bandwidth reduction for accesses beyond the level of the memory hierarchy holding the blocks is proportional to the square root of the block size. For double-precision values, the current 256KiB L2 caches allow for block sizes of up to about 100x100 (3 blocks * 8 Bytes/element * 100*100 = 240,000 Bytes = 91% of the L2 cache), which provides a bandwidth reduction of 50x. This is marginal on current Haswell systems -- on our Xeon E5-2690 v3 (12 core, 2.6 GHz) we can get good performance using all 12 cores, but when using 24 cores we we get poor scaling if all of the data is located in the memory of one of the two sockets. Interleaving the data across the memory of both sockets (using "numactl --interleave=0,1") provides enough memory bandwidth (and QPI bandwidth) to regain good scaling. If peak FLOP rates continue to increase faster than sustained bandwidths, another layer of blocking will be required to achieve full performance even on a single socket. This extra layer of blocking could be in the L3 or in a hypothetical future L4 cache, but whatever level is used is going to have to keep growing quadratically, since the bandwidth reduction is proportional to the square root of the block size.