Throughput MUL/FMA Broadwell

Timocafé · ‎02-12-2019

Hello,

I am suspecting last Intel architecture to perform the mnemonic (vectorial double) `MUL` like a `FMA` but with a null addition (on broadWell architecture, and probably beyond).

In details, I am currently performing product of Quatric polynomials (Pi), following the pattern.

P1*P2*P3*P4

Every polynomial Pi(x) = a + bX +cX^2 is evaluated by two successives `FMA`. However, when I measure the throughput of my problem, the number are very low. Following the Table of Agner Fog [Agner Fog][1] page 242, the throughput of a `FMA` and `MUL` is 0.5. The definition of the throughput: is the time in [cycle] to perform a new identical mnemonic.

So I should get a penalty between the `FMA` and the `MUL`, however my measurement is smooth. I suspect the processor under the hood, swap the `MUL` by a `FMA` with a null addition (although my ASM has FMA and MUL), or at least use an identical part of the circuit in the FPU, which explain my results.

I may be completely wrong, but if a hardware engineer could confirm or infirm.

All the best

++t

[1]: https://www.agner.org/optimize/instruction_tables.pdf

McCalpinJohn · ‎02-13-2019

Agner's tables show the throughput for sequences of independent instructions, and the latency for sequences of dependent instructions. With a strongly out-of-order processor, it can be difficult to tell how much independent work the hardware can find across loop iterations, making cycle count estimation challenging.

In your example, a + b*x + c*x^2, a common rearrangement is a + x*(b+c*x). This can be computed with 2 FMAs, but the second FMA is dependent on the output of the first. With constant a, b, c, and a contiguous vector x, it is possible for the hardware to fully pipeline these operations, but in the presence of additional instructions, performance may be limited by factors other than the FMA throughput -- e.g., loads, stores, total instructions, etc. The compiler models this, and when there is another performance limiter, it does not always generate enough temporaries to fully hide the FMA latencies.

The compiler's job is made somewhat more difficult by the 3-operand FMA format used by Intel. Back in the olden days, SSE arithmetic instructions only had two arguments, with one of the input arguments overwritten by the output. If both input arguments were constants, the compiler needed to keep the original (constant) values in separate registers and generate extra register-to-register copies so that the SSE instruction would overwrite a copy of the constant, rather than the constant itself. AVX provides a three-operand instruction format that eliminates this for dyadic operations like add and multiply, but that is not enough for FMA -- it must also overwrite one of its inputs. In the example above, the first FMA is (b+c*x), where all are constants. So again the compiler must keep the original values in registers and copy one of them into a temporary register to be overwritten by the FMA. This adds to the overall instruction count and (as seen below) reduces throughput for L1-contained data.

Intel's processors show a fairly non-intuitive history of floating-point operation latencies and throughputs. From Agner's tables, the latency and reciprocal throughput for packed double Add/Mul/FMA instructions have evolved like this:

Processor       Add        Mul        FMA      Notes
Haswell           3/1        5/0.5     5/0.5      <-- only Port 1 can do Add, Ports 0-1 can do Multiply or FMA
Broadwell       3/1        3/0.5     5/0.5      <-- reduced Multiply latency
Skylake           4/0.5    4/0.5      4/0.5     <-- symmetric Ports 0-1, uniform latency for Add/Mul/FMA

As a simple example, I compiled a code that performs this operation on a 1024-element vector of input elements, producing a 1024-element vector of output elements (allowing everything to fit in L1 cache):

        for (i=0; i<1024; i++) y = a + b*x + c*x*x;

I compiled for both CORE-AVX2 and CORE-AVX512 and ran on a Xeon Platinum 8160 (frequency pinned to 2.1 GHz to make measurement easier).

For the CORE-AVX2 target, the compiler unrolled the loop to process 16 indices per iteration, using 4 accumulators (each holding 4 doubles). The unrolled inner loop contains 4 loads (of x[]), 4 register-to-register copies (to allow overwriting), 8 FMAs (operating on 4 accumulators), 4 stores (of y[]), and 3 loop control instructions (add, compare, branch), for a total of 23 instructions. The compare and branch should be fused, giving 22 uops per iteration. Execution time was about 5.9 cycles per loop iteration, or 3.95 instructions per cycle -- very close to the expected 4 instruction per cycle issue limit, even though 4 accumulators is only 1/2 of what is needed to fully tolerate the FMA latency of 4 cycles.

For the CORE-AVX512 target, the generated code also processed 16 indices per iteration. Due to the wider vectors, the instruction counts were 2 loads, 2 register-to-register copies, 4 FMAs (using 2 accumulators), 2 stores, and 4 loop control instructions (2 adds, compare, branch), for a total of 14 instructions. Execution time was about 3.57 cycles per iteration, or 4.3 instructions per cycle. (This is 4.0 uops per cycle after correcting for the fusion of the compare and branch.) Again, the compiler used only 2 accumulators, while 8 would be necessary to hide the 4-cycle latency of 2 FMA pipes, but performance is limited by instruction issue and the FMA stalls are effectively overlapped with other instructions.

Timocafé · ‎02-19-2019

An answer of the Dr. Bandwidth <3, it's an award.

What is not clear for me, if the following situation. Let's consider a FMA and MUL which are independents, and an Intel processor with a single FPU (to simplify)

The two mnemonic are different but there are executed on the same hardware FMA unit. So does the FPU will perform the FMA then the MUL so a la latency of 8 [cycles]. Or, does the processor will pipeline the MUL because, the two mnemonics are executed on the FMA unit (I mean the same transistor into the hardware) so a latency of 4.5 [cycles] ?

McCalpinJohn · ‎02-19-2019

For operations that are all independent, latency does not really mean anything -- it is only dependencies that can make the latency visible for timing tests.

For Intel processors, all of the basic FP operations are fully pipelined, so each functional unit can accept one instruction per cycle. In your example, the FMA and Mul would be issued on two consecutive cycles, and each would complete execution after the corresponding operation latency. On processors where the FMA and Mul latency is the same (e.g., Haswell and Skylake, from my table above), the operations would complete execution in the same order that they were issued. For Skylake, both latencies are 4 cycles, giving a schedule like:

Cycle        Operation
    0               FMA issues
    1                Mul issues
    2
    3
    4               FMA completes
    5               Mul completes

I agree that it is not completely obvious that instructions that pipeline with identical instructions will also pipeline with other (pipelined) instructions that execute in the same unit. For Intel's mainstream processors, the pipelining is (almost?) always of this fully general type.

For instructions that are not fully pipelined (e.g., FP divide), there is typically *some* overlap possible between the execution of the divide instruction and the execution of independent instructions in the same functional unit, but this is fairly tricky to measure. An older (but very well written) paper on this topic is "Floating Point Division and Square Root Algorithms and Implementation in the AMD K7 Microprocessor", by Oberman, et al. (https://doi.org/10.1109/ARITH.1999.762835).