Sorry about my comment #4 --

Daniel_F_ · ‎12-16-2016

I am optimizing app which counts correlation coefficients many times. Loops were easy to vectorize, but there are also some calculations made outside of them. I tried to partially optimize them using SSE and then as an obvious next step use FMA. It turned out that SSE version has roughly the same performance as original one, and to my surprise FMA version was about 3% slower. I checked generated code and looks as expected. GCC also tried to optimize my code in such way, so I had to add target("no-fma") attribute to prevent this.

Why FMA version is slower?

Code used for testing is below. Platform: Xeon E5-2680, Linux Redhat 6 x86_64, gcc 4.8.2. Compiled using following command: g++ -o test test.cc -O3 -march=core2 -mtune=core2 -mfma

#include "immintrin.h"
#include "stdio.h"
#include "time.h"
#include "math.h"

__attribute__((noinline, target("no-fma")))
double test(double a, double b, double c)
{
    return (a - b * c) / sqrt((1 - b * b) * (1 - c * c));
}

static const __m128d v_1 = { 1.0, 1.0 };

__attribute__((noinline, target("no-fma")))
double test2(double a, double b, double c)
{
    __m128d v_bc = { b, c };

    __m128d v1 = _mm_sub_pd(v_1, _mm_mul_pd(v_bc, v_bc));
    __m128d v2 = _mm_permute_pd(v1, 1);
    v1 = _mm_mul_pd(v1, v2);

    return (a - b * c) / sqrt(v1[0]);
}

__attribute__((noinline))
double test3(double a, double b, double c)
{
    __m128d v_bc = { b, c };

    __m128d v1 = _mm_fnmadd_pd(v_bc, v_bc, v_1);
    __m128d v2 = _mm_permute_pd(v1, 1);
    v1 = _mm_mul_pd(v1, v2);

    return (a - b * c) / sqrt(v1[0]);
}

#define CNT 500000000ull

double a = 0.1, b = 0.2, c = 0.3;

int main()
{
    clock_t t;
    double d = 0.0;
    t = clock();
    for (size_t n = 0; n < CNT; ++n)
        d += test(a, b, c);
    printf("%0.3f\n", (double)(clock() - t) / CLOCKS_PER_SEC);

    t = clock();
    for (size_t n = 0; n < CNT; ++n)
        d += test2(a, b, c);
    printf("%0.3f\n", (double)(clock() - t) / CLOCKS_PER_SEC);

    t = clock();
    for (size_t n = 0; n < CNT; ++n)
        d += test3(a, b, c);
    printf("%0.3f\n", (double)(clock() - t) / CLOCKS_PER_SEC);

    printf("%f\n", d);

    return 0;
}

McCalpinJohn · ‎12-16-2016

Lots of issues here....

The Xeon E5-2680 does not support FMA instructions. Support for FMA starts in Xeon E5-xxxx v3.
Without function inlining you are likely to see significant function call overhead.
Without function inlining you will not be able to overlap the latencies of the arithmetic operations for consecutive elements.
The "test2" and "test3" functions are doing twice as much work as the "test" function, since they are operating on two elements at a time, but are still called the same number of times.
The latency for FMA is longer than the latency for ADD. Most of the time this can be ignored because consecutive operations can be pipelined, but you have prevented that with the "noinline attribute.
Effective vectorization of sum reductions requires multiple partial sums.
1. For Xeon E5 v3, with an FMA latency of 5 cycles and two FMA units, you need 10 partial sums to fully overlap the instruction latency. (Most of the time you will run out of load bandwidth before you reach full speed, but you will need multiple partial sums to get reasonable performance in any case.)
2. The compiler will often generate multiple partial sums if the code is simple enough, but I often have to tweak it to generate as many as are needed. This can be done either by manually introducing multiple partial sum variables or by using an "unroll" pragma on the loop.
3. Using an OpenMP "parallel for" with a "reduction" clause often results in the best code (even if running with only one thread) -- presumably because the compiler sees this as a weakening of the limitations on associative transformations.

jimdempseyatthecove · ‎12-20-2016

In addition to John's comments, typically measures the vectorized version of the AVX instructions. As coded you are using scalar and due to what John mentioned with no-inline you've got the call overhead (though a,b and c may be passed via registers).

// a, b and c are aligned allocated

__attribute__((target("no-fma")))
double test(
  double* restrict a __attribute__((aligned (64))),
  double* restrict b __attribute__((aligned (64))),
  double* restrict c __attribute__((aligned (64))),
  int count)
{
  double ret = 0.0;
  #pragma simd reduction(+:ret)
  for(int i=0; i<count; ++i)
    ret += (a - b * c) / sqrt((1.0 - b * b) * (1.0 - c * c));
}


__attribute__((target("fma")))
double test_fma(
  double* restrict a __attribute__((aligned (64))),
  double* restrict b __attribute__((aligned (64))),
  double* restrict c __attribute__((aligned (64))),
  int count)
{
  double ret = 0.0;
  #pragma simd reduction(+:ret)
  for(int i=0; i<count; ++i)
    ret += (a - b * c) / sqrt((1.0 - b * b) * (1.0 - c * c));
}

Jim Dempsey

Daniel_F_ · ‎01-26-2017

Mccalpin, John wrote:

Lots of issues here....

The Xeon E5-2680 does not support FMA instructions. Support for FMA starts in Xeon E5-xxxx v3.

Without function inlining you are likely to see significant function call overhead.

Without function inlining you will not be able to overlap the latencies of the arithmetic operations for consecutive elements.

The "test2" and "test3" functions are doing twice as much work as the "test" function, since they are operating on two elements at a time, but are still called the same number of times.

The latency for FMA is longer than the latency for ADD. Most of the time this can be ignored because consecutive operations can be pipelined, but you have prevented that with the "noinline attribute.

Effective vectorization of sum reductions requires multiple partial sums.

For Xeon E5 v3, with an FMA latency of 5 cycles and two FMA units, you need 10 partial sums to fully overlap the instruction latency. (Most of the time you will run out of load bandwidth before you reach full speed, but you will need multiple partial sums to get reasonable performance in any case.)

The compiler will often generate multiple partial sums if the code is simple enough, but I often have to tweak it to generate as many as are needed. This can be done either by manually introducing multiple partial sum variables or by using an "unroll" pragma on the loop.

Using an OpenMP "parallel for" with a "reduction" clause often results in the best code (even if running with only one thread) -- presumably because the compiler sees this as a weakening of the limitations on associative transformations.

Thanks for answer. Here are my comments for this:

Ad,1. You are right, I forgot to copy that v3.

Ad 2, 3. I done this on purpose for testing, to prevent gcc from optimizing my functions too much. Final production code uses inline, extra cost of function call definitely should be avoided in tight loops.

Ad 4. I do not get it. All 3 functions get 3 scalar double values and returns 1 scalar double. There are no vectors or pointers to arrays of data used as parameters. My intention was to use SSE/FMA inside function, without changing its declaration. When I wrote my question my goal was to find why function test3 is slower than test2. I provided function test only for reference and to explain what I had as a starting point.

Ad.5. Multiply operation has the same latency and throughput as an FMA operation on vector of the same length. Addition operation depends on result of multiplication, so CPU will not be able to execute both of them in parallel. Addition also has some non-zero latency and throughput, so it also will need some time. CPU will try to pipeline this, but still two SSE operations should take more time than corresponding FMA instruction. And in both cases result is fed to the same sqrt instruction. Additionally expression (a - b * c) in numerator will be compiled to one FMA instruction too, what also should improve things a bit. So with this assumptions FMA version should be faster. Which of my assumptions here is wrong?

Ad.6. Thanks for these suggestions. Unfortunately my app mostly executes expressions provided here, so probably I cannot do much with this. Especially now, after I split calculations to calculate reciprocal square roots (in form 1/sqrt(1-x*x)) first and then use these results multiple times in expressions like (a-b*c)*d*e. Everything is stored in arrays, so I simply calculate multiple expressions at once. Example functions provided earlier are my attempt to optimize last step of calculations when I am left with 3 double values and have to perform last calculation on them.

McCalpinJohn · ‎01-26-2017

Sorry about my comment #4 -- I did not look carefully enough at what your code was doing....

This sort of vectorization of what is effectively scalar code is not typically very helpful -- the overhead of packing/unpacking and permuting generally exceeds the benefit from a reduction in the number of arithmetic operations.

To understand the results I would do three things to get started:

First, draw a data dependence diagram and compute an estimate of the critical path latency for the "test()" version of the function.
Second, compare that estimated critical path with the measured performance in cycles.
- It helps to pin the processor frequency during these tests so that you know how to convert seconds to cycles.
Third, compare the compiler-generated assembly code for the three versions of the computation, and see if it is consistent with the expected critical path and the observed cycle count.

Why FMA is slower than SSE here?