Additional hypothesis for

McCalpinJohn · ‎01-09-2019

In December 2017, a colleague asked for some help in porting a synthetic "Peak GFLOPS" code to our Xeon Phi 7250 systems. This code just executes register-to-register VFMADD instructions in an unrolled loop. While trying to understand various idiosyncrasies of the performance characteristics, it became clear that the Xeon Phi x200 cannot execute VPU instructions at more than 6/7 of the nominal peak performance -- i.e., 12 VPU instructions should take 6 cycles, but we observe that 7 cycles are required.

This effect was observed in hundreds of test cases, using VPU instructions of any width, any latency, or any ISA, but the effect does *not* apply to ALU or Memory instructions.

My colleague Damon McDougall (now at AMD) made a short presentation on this topic at the IXPUG Fall Conference in September 2018 (https://www.ixpug.org/components/com_solutionlibrary/assets/documents/1538587841-IXPUG_Fall_Conf_2018_paper_16%20%282%29%20-%20Damon%20McDougall.pdf)

A longer write-up, including description of some previously undocumented performance counter masks, is available at https://sites.utexas.edu/jdm4372/2018/01/22/a-peculiar-throughput-limitation-on-intels-xeon-phi-x200-knights-landing/

Note that this does not appear to impact the performance of any "real" codes! Even DGEMM is not impacted because about 20% of the instructions in DGEMM are not VPU instructions, so the 2-instruction-per-cycle limit reduces the peak VPU rate to 1.6 instructions per cycle, which is below the limit explored here.

jimdempseyatthecove · ‎01-10-2019

John,

from the second link above, the code used was:

..B1.8:
addl $1, %eax
vfmadd213pd %zmm16, %zmm17, %zmm29
vfmadd213pd %zmm16, %zmm17, %zmm28
vfmadd213pd %zmm16, %zmm17, %zmm27
vfmadd213pd %zmm16, %zmm17, %zmm26
vfmadd213pd %zmm16, %zmm17, %zmm25
vfmadd213pd %zmm16, %zmm17, %zmm24
vfmadd213pd %zmm16, %zmm17, %zmm23
vfmadd213pd %zmm16, %zmm17, %zmm22
vfmadd213pd %zmm16, %zmm17, %zmm21
vfmadd213pd %zmm16, %zmm17, %zmm20
vfmadd213pd %zmm16, %zmm17, %zmm19
vfmadd213pd %zmm16, %zmm17, %zmm18
cmpl $1000000000, %eax
jb ..B1.8

What about using

..B1.8:
addl $1, %eax
vfmadd213pd %zmm16, %zmm17, %zmm29
vfmadd213pd %zmm15, %zmm17, %zmm28
vfmadd213pd %zmm16, %zmm17, %zmm27
vfmadd213pd %zmm15, %zmm17, %zmm26
vfmadd213pd %zmm16, %zmm17, %zmm25
vfmadd213pd %zmm15, %zmm17, %zmm24
vfmadd213pd %zmm16, %zmm17, %zmm23
vfmadd213pd %zmm15, %zmm17, %zmm22
vfmadd213pd %zmm16, %zmm17, %zmm21
vfmadd213pd %zmm15, %zmm17, %zmm20
vfmadd213pd %zmm16, %zmm17, %zmm19
vfmadd213pd %zmm15, %zmm17, %zmm18
cmpl $1000000000, %eax
jb ..B1.8

IOW alternating destination registers

Note, the above does not include the summation of zmm16 and zmm15 which can occur outside of the loop.

Jim Dempsey

McCalpinJohn · ‎01-10-2019

This assembly format is the other way around -- the output registers are zmm18-zmm29. You need at least 12 accumulators to tolerate a 6-cycle dependent-operation latency with two vector pipes.

We did experiments with fewer accumulators and showed exactly the expected behavior -- e.g., with 10 accumulators you get 10 FMAs every 6 cycles (limited by the dependent-operation latency). With more than 12 accumulators the behavior was exactly the same as with 12 accumulators -- 6/7 of peak.

We tried a variety of cases with different (non-overwritten) input registers, just in case there was a limit on the rate that a single register could be read, but observed no changes -- still 6/7 of peak.

The KNL register architecture (as described in the IEEE Micro paper) is unusual -- the architectural registers are not a subset of the rename registers, but are a separate structure that is updated at instruction retirement. This requires a extra read port on the rename registers, which seems expensive to me.... If I am counting correctly, 2 FMAs per cycle, the FP rename registers have to support 8 512-bit reads and 2 512-bit writes. (Each FMA requires reading 3 input registers and writing 1 output register, and the retirement unit has to read the output register for each FMA in order to copy it to the architectural registers.) That is a lot of ports!

jimdempseyatthecove · ‎01-10-2019

Sorry about the assembler muck-up (ATT assembler format verses Microsoft assembler format as to if destination is rightmost or leftmost register).

I hope that the Intel CPU design engineers can make use of this test to locate and tweak the performance a bit.

In looking at your utexas.edu link and then going to your home page of your blog, you have an interesting presentation of performance variability, which appears to be caused by Snoop Filter Conflicts. This is quite interesting and may shed additional light upon an observation (and hypothesis) of the behavior of a program illustrated on http://www.lotsofcores.com/ (Plesiochronous Phasing Barriers in Action). In viewing the video, bottom right quadrant, it can be observed that (after first complete pass populating cache) the thread stall time is not consistent (indicating some threads taking longer than others). In the presentation I assumed that this was attributable to an interaction with interrupt service causing cache line evictions. After reading your paper on Snoop Filter Conflicts, it may be a case that some of the thread completion skew is due to Snoop Filter Conflicts.

FWIW, the technique used in that article was devised to circumvent the problem you observed (page 29) with Static load distribution with synchronization whereby when one thread is delayed by (Snoop Filter Conflict induced) cache eviction in reaching the synchronization point, that all threads (being synchronized) suffer the delay. The Plesiochronous Phasing Barrier mitigates this by synchronizing reduced to the HT within core basis as opposed to all cores/threads in team. IOW permitting the cores to skew (when possible).

Jim Dempsey

jimdempseyatthecove · ‎01-10-2019

Additional hypothesis for cause of Snoop Filter Conflicts

The cause may not necessarily be cause by juxtaposition of cache lines between cores due to the event observation being ~1% of the time and potentially restricted to specific cores within the CPU. The cause may due to a confliction induced the (near) simultaneous access of particular cache line combinations. IOW should the progress of the core differ from the conflicting observation then the conflict would not appear. A second possibility to explore would be if the conflict occurs when conditions are right (wrong) for Core-N -> Core-M but not Core-M -> Core-N (i.e. due to path taken in the conflict resolution).

Jim Dempsey

McCalpinJohn · ‎01-10-2019

The SKX snoop filter is 12-way set-associative, so to get into a "thrashing" scenario you have to be accessing at least 13 lines repeatedly in a relatively short time period. With 24 cores/48 threads, it is possible to get into this state with each core or thread accessing only a single cache line, but all my tests are based on threads accessing contiguous (virtual) blocks.

KNL also shows snoop filter conflicts, but I don't think that Intel has documented the associativity on that processor. The problem is somewhat less obvious on KNL because the shared L2 cache makes the L2 cache miss rate more variable. The interaction of relative thread timing and the dynamic behavior of the L2 hardware prefetcher makes L2 cache miss rates pretty close to unpredictable.....

A peculiar throughput limitation for VPU instructions in Xeon Phi x200 (KNL)