Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Issue with O2/O3 optimisation using Intel compiler 2017 update 2

Arthur_P_
Beginner
933 Views

Hello,

While trying to compare parallelism between OMP, MPI, Cuda and OpenACC, I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213". The code below is solving the classical temperature distribution in 2D (just for test):

// /opt/intel/bin/icpc -O2 main.cpp && time ./a.out
#include <iostream>
#include <cmath>

int main()
{

  const int nx = 2800;
  const int ny = 2800;
  const float lx = 1.f;
  const float ly = 1.f;
  float dx = lx / (float)nx;
  float dy = ly / (float)ny;
  const float lambda = 0.001f;
  float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f);
  float dtdx = dt*lambda/dx/dx/12.f;
  float dtdy = dt*lambda/dy/dy/12.f;

  int nxt = nx + 4;
  int nyt = ny + 4;

  float* T = new float[nxt*nyt];
  float* Told = new float[nxt*nyt];

  for (int i = 0; i < nxt; i++) {
    for (int j = 0; j < nyt; j++) {
      int ind = i*nyt + j;
      T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2)));
      Told[ind] = T[ind];
    }
  }

  for (int step=0; step<1000; step++) {

    for (int i = 2; i < nxt-2; i++) {
      for (int j = 2; j < nyt-2; j++) {
        int ind = i*nyt + j;
        T[ind] = Told[ind] + dtdx * (-Told[(i-2)*nyt + (j)]+16.f*Told[(i-1)*nyt + (j)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i+1)*nyt + (j)]-Told[(i+2)*nyt + (j)])
        + dtdy * (-Told[(i)*nyt + (j-2)]+16.f*Told[(i)*nyt + (j-1)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i)*nyt + (j+1)]-Told[(i)*nyt + (j+2)]);
      }
    }

    for (int i = 0; i < nxt*nyt; i++) {
        Told = T;
    }

  }

  float sum = 0.0f;
  for (int i = 0; i < nxt*nyt; i++) {
      sum += T;
  }

  std::cout << sum/(float)(nxt*nyt) << std::endl;

  return 0;
}

After 1000 "time" iterations, the code is supposed to give the results: 1.08712 (using O0 or O1 compile flag)

Without any parallelism (no openmp), the code is compiled with O2 optimisation and gives the following (wrong) results: 1.09783

/opt/intel/bin/icpc -O2 main.cpp && time ./a.out

Using LLVM C++ (Apple) or GNU G++, with O2 or even O3 optimization flag, gives good results (1.08712).

It seems that Intel compiler with O2 or O3 compile flag does aggressive optimizations on floating-point data. In order to get the good result, -fp-model precise flag needs to be added to the command line.

Why does the O2 flag create such aggressive optimizations? (while GNU or LLVM does not) I mean, I know that O3 can be a dangerous flag to use but I thought O2 was, a least, usable. If you are not aware of the -fp-model flag, your results may be completely wrong...

Thank you.

0 Kudos
5 Replies
TimP
Honored Contributor III
933 Views

You may need to define sum as double to find out whether batched sum may  be the  more accurate.

0 Kudos
jimdempseyatthecove
Honored Contributor III
933 Views

Try: -no-fast-transcendentals

This then should use higher precision functions for exp and pow.
Note, -fp-model strict may (should) imply -no-fast-transcendentals, but I cannot attest to this.

Jim Dempsey

0 Kudos
SergeyKostrov
Valued Contributor II
933 Views
>>...In order to get the good result, -fp-model precise flag needs to be added to the command line. Options -fp-model precise and, for example -fp-model fast=2, are not identical and results will be different. >>...If you are not aware of the -fp-model flag, your results may be completely wrong... A developer should know that results could be less accurate ( not completely wrong... ). Also, as Tim P mentioned saving intermediate results of calculations in variable(s) of double-precision type(s) always improves accuracy of computations. >>...I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213" It is expected behavior because different FPU modes ( when -fp-model options are used ) don't guarantee identical results of computations.
0 Kudos
TimP
Honored Contributor III
933 Views

As Jim  implied, the fast-transcendentals option (short vector math functions) may reduce precision of pow() by as much as 4 Ulp, but that should show up only beyond 14th decimal place.  fast-transcendentals is implied by fp-model fast.  I would be surprised if icpc took advantage of C++ to replace pow() and exp() by powf() and expf(), which might show differences in the 5th decimal.  I do find the various C++ standards and practices confusing on this point. 

The varying precision of float sum reduction with order of evaluation is entirely likely to explain the observed differences.  If you want the least accurate result, corresponding to strict application of the language rule against re-association, that clearly doesn't correspond to -fp-model fast.

You could check the differences due to vectorization one loop at a time with #pragma novector.  g++ will not vectorize any of this without at least -ffast-math, loosely corresponding to the icpc option -fp-model fast.

0 Kudos
SergeyKostrov
Valued Contributor II
933 Views
Here are notes I've recently made when investigating some accuracy issues. ... // - If for CMMA -fp-model fast=2 used accuracy is affected. For example, // [ SP FP Processing - 8192 x 8192 - LPS=ijk - CS=ij:ik:jk ] // ... // Matrix C first four elements: // 8192.099609 8192.049805 8192.099609 8192.049805 ( -fp-model fast=2 ) // vs. // 8192.003906 8192.003906 8192.003906 8192.003906 ( -fp-model precise ) // vs. // accuracy also improves when an intermediate variable sum is // declared as DP FP type ( double ): // // 8192.082031 8192.082031 8192.082031 8192.082031 // // Compiled: // icpc -O3 -xMIC-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 test14.c -o test14.out // // Abbreviations: // MKL - Math Kernel Library // CBLAS - C Basic Linear Algebra Subprograms // CMMA - Classic Matrix Multiplication Algorithm // LPS - Loop Processing Schema // CS - Compute Schema // FP - Floating Point // SP - Single Precision // DP - Double Precision ...
0 Kudos
Reply