FMA not used

velvia · ‎05-10-2017

Hi,

I have been surprised to spot the following behavior of Intel compiler (17.0.2 20170213 on Linux), using -xCORE-AVX2. The following code generates FMA instructions

double norm(double* x, int n) {
  ans = 0.0;
  for (int i = 0; i < n; ++i) {
    ans += x * x;
  }
  return ans;
}

but the following code does not

float norm(float* x, int n) {
  ans = 0.0f;
  for (int i = 0; i < n; ++i) {
    ans += x * x;
  }
  return ans;
}

Is there a reason for this, or is it a missed optimization form the compiler?

Best regards,

Francois

TimP · ‎05-10-2017

When icc does use FMA for float data type dot product, it riffles by more than sufficiently large factor to cover the extra latency of FMA in the case where the operands are present in L1. icc will choose not to use FMA if cost evaluation shows FMA may be slower. This may be influenced by the assumed trip count, which you can adjust by pragma. Your choice of 32- or 64-bit target also may influence the choice.

As a matter of interest (at least to me), gcc would need the -mno-fma -ffast-math options to show best AVX2 performance here, as there is no riffling. The fma may run 60% longer, in accordance with the documented latency.