Preventing FP overcounts for AVX instructions on Sandy Bridge

McCalpinJohn · ‎08-03-2015

As most readers of this forum are aware, the performance counter events for floating-point operations can overcount significantly on Sandy Bridge/Ivy Bridge platforms. This applies to both Event 0x10 "FP_COMP_OPS_EXE.*" and Event 0x11 "SIMD_FP_256.*". The overcounts are clearly related to stalls -- the counts appear to be very close for data in the L1 Data Cache, increase slightly for data in the L2 cache, increase significantly for data in the L3 cache, and are very high for data in memory. The degree of overcounting depends on the details of the code generated and the load on the memory subsystem, but there are lots of common cases that result in overcounting by 3x to 10x.

Last month I accidentally found a way to prevent these overcounts when using AVX instructions on the Sandy Bridge processors!

Using the STREAM benchmark as my test code, I was surprised to find that the MPI version showed no overcounting when compiled for AVX. It took some sleuthing, but it turns out that the compiler generates slightly different code to load the data for this case.

The basic OpenMP version of the code compiled for AVX generates the following code for the STREAM Triad kernel:

                                # LOE rax rdx rcx rbx rsi rdi r8 r12 r13 r14 r15d ymm0
..B1.107:                       # Preds ..B1.107 ..B1.106
        lea       (%rbx,%rax,8), %r9                            #380.43
        addq      $4, %rax                                      #380.5
        vmulpd    (%r8,%r9), %ymm0, %ymm1                       #380.55
        vaddpd    (%rdi,%r9), %ymm1, %ymm2                      #380.55
        vmovntpd  %ymm2, (%rdx,%r9)                             #380.36
        cmpq      %rcx, %rax                                    #380.5
        jb        ..B1.107      # Prob 50%                      #380.5

The MPI version of the code is very similar except that it loads the 256-bit data values into registers in two steps -- first loading the lower 128 bits of the 256-bit input into the lower 128-bits of a temporary register, then merging that with the next 128-bits from memory using the VINSERTF128 instruction. To verify that this was actually the cause of the change, I took the code above and hacked the assembly file manually to perform this 2-step load on both inputs to the Triad kernel.

                                # LOE rax rdx rcx rbx rsi rdi r8 r12 r13 r14 r15d ymm0
..B1.107:                       # Preds ..B1.107 ..B1.106
        lea       (%rbx,%rax,8), %r9                            #380.43
        addq      $4, %rax                                      #380.5
        vmovupd   (%r8,%r9),%xmm15                                              # MCCALPIN - patch002
        vinsertf128 $1,16(%r8,%r9),%ymm15,%ymm14                                # MCCALPIN - patch002
        vmulpd    %ymm14, %ymm0, %ymm1                          #380.55         # MCCALPIN - patch002
#        vmulpd    (%r8,%r9), %ymm0, %ymm1                      #380.55
        vmovupd   (%rdi,%r9),%xmm15                                             # MCCALPIN - patch002
        vinsertf128 $1,16(%rdi,%r9),%ymm15,%ymm14                               # MCCALPIN - patch002
        vaddpd    %ymm14, %ymm1, %ymm2                          #380.55         # MCCALPIN - patch002
#        vaddpd    (%rdi,%r9), %ymm1, %ymm2                     #380.55
        vmovntpd  %ymm2, (%rdx,%r9)                             #380.36
        cmpq      %rcx, %rax                                    #380.5
        jb        ..B1.107      # Prob 50%                      #380.5

I then compiled the modified assembly language file and ran in both single-thread and multi-threaded cases (8 threads on 1 8-core chip). Performance counter Event 0x11, Umask 0x02 SIMD_FP_256.PACKED_DOUBLE showed:

Expected Value: 20 million

Original Code: 1 thread: 83 million (more than 4x overcount)

Original Code: 8 threads: 122 million (more than 6x overcount)

Modified Code: 1 thread: 20.06 million (0.3% overcount)

Modified Code: 8 threads: 20.32 million (1.6% overcount)

This version of the code also ran about 3.5% faster on a single thread -- similar to the improvement seen when using SSE on Sandy Bridge EP -- while the patched version of the code ran at about the same speed (0.6% faster) when using all 8 cores on a single chip.

I built another version of the code that loaded the data with a single instruction, but permuted it during the load (VPERMILPD). This also eliminated the floating-point overcounts, but without the (relatively small) performance improvement seen with the 128-bit loads. (The VPERMILPD version would change the answers with most codes, but since all the elements of each array are the same in the STREAM benchmark it does not matter if the elements are permuted.) The combination of these two results suggests that it any instruction that executes on Port 5 breaks the link between the AVX arithmetic instructions and the preceding load and eliminates the retries that occur while the core is waiting for the data to arrive.

I spent a while looking for a similar "trick" for SSE, but was unsuccessful. Simply loading the data into a register, then using the register as the input to the SSE arithmetic instruction made no change to the overcount ratio (about 3.1x using a single thread). Loading the data 64 bits at a time using MOVLPD and MOVUPD reduced the overcounting slightly (to about 2.5x), but this is clearly not the same kind of "fix" that I see with AVX.

If anyone out there is an expert with binary instrumentation, it would be interesting to hear whether a tool like "Pin" might be able to rewrite a binary to change 256-bit loads into the two-part version with VINSERTF128 that eliminates AVX floating-point overcounts....

TimP · ‎08-03-2015

This looks as if your MPI code were built without 32-byte alignment assertions being in effect.

McCalpinJohn · ‎08-03-2015

The difference in code generation is clearly due to the differences in the way that the arrays were allocated and in the way that the compiler responded to those differences. The side effect on the performance counter results was completely unexpected!

It would be interesting to have a compiler option to force this behavior, since it makes the AVX floating-point operation counts accurate, but that would have made more sense when Sandy Bridge was new. At this point a binary re-writer might make sense, if the existing tools are flexible enough to make these changes.

McCalpinJohn · ‎08-04-2015

Further testing shows that if I load data using the VPERMILPD instruction (into an explicitly named temporary register), the floating-point instructions that use this data do not overcount. This is the case even if the VPERMILPD instruction uses the "identity" permutation -- loading contiguous data into contiguous positions in the register. For scalar data I use the 128-bit version of VPERMILPD to load the desired 64 bits into the upper and lower halves of a 128-bit XMM register, then only used the lower 64-bits of the register in subsequent instructions. This gave the correct answers, prevented overcounting of scalar FLOPS (Event 0x10 rather than Event 0x11), and did not impact the performance on the test code (STREAM).

It would be interesting to know if there are any binary re-writing tools that could replace *all* loads to XMM or YMM registers with VPERM* instructions into a temporary register. This requires an extra register name, but only one, since all loads can use the same temporary register name (the hardware register renamer will handle any false dependencies that this appears to generate).