Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- Issue with O2/O3 optimisation using Intel compiler 2017 update 2

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Arthur_P_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-12-2017
04:01 PM

77 Views

Issue with O2/O3 optimisation using Intel compiler 2017 update 2

Hello,

While trying to compare parallelism between OMP, MPI, Cuda and OpenACC, I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213". The code below is solving the classical temperature distribution in 2D (just for test):

// /opt/intel/bin/icpc -O2 main.cpp && time ./a.out #include <iostream> #include <cmath> int main() { const int nx = 2800; const int ny = 2800; const float lx = 1.f; const float ly = 1.f; float dx = lx / (float)nx; float dy = ly / (float)ny; const float lambda = 0.001f; float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f); float dtdx = dt*lambda/dx/dx/12.f; float dtdy = dt*lambda/dy/dy/12.f; int nxt = nx + 4; int nyt = ny + 4; float* T = new float[nxt*nyt]; float* Told = new float[nxt*nyt]; for (int i = 0; i < nxt; i++) { for (int j = 0; j < nyt; j++) { int ind = i*nyt + j; T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2))); Told[ind] = T[ind]; } } for (int step=0; step<1000; step++) { for (int i = 2; i < nxt-2; i++) { for (int j = 2; j < nyt-2; j++) { int ind = i*nyt + j; T[ind] = Told[ind] + dtdx * (-Told[(i-2)*nyt + (j)]+16.f*Told[(i-1)*nyt + (j)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i+1)*nyt + (j)]-Told[(i+2)*nyt + (j)]) + dtdy * (-Told[(i)*nyt + (j-2)]+16.f*Told[(i)*nyt + (j-1)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i)*nyt + (j+1)]-Told[(i)*nyt + (j+2)]); } } for (int i = 0; i < nxt*nyt; i++) { Told= T; } } float sum = 0.0f; for (int i = 0; i < nxt*nyt; i++) { sum += T; } std::cout << sum/(float)(nxt*nyt) << std::endl; return 0; }

After 1000 "time" iterations, the code is supposed to give the results: 1.08712 (using O0 or O1 compile flag)

Without any parallelism (no openmp), the code is compiled with O2 optimisation and gives the following (wrong) results: 1.09783

/opt/intel/bin/icpc -O2 main.cpp && time ./a.out

Using LLVM C++ (Apple) or GNU G++, with O2 or even O3 optimization flag, gives good results (1.08712).

It seems that Intel compiler with O2 or O3 compile flag does aggressive optimizations on floating-point data. In order to get the good result, -fp-model precise flag needs to be added to the command line.

Why does the O2 flag create such aggressive optimizations? (while GNU or LLVM does not) I mean, I know that O3 can be a dangerous flag to use but I thought O2 was, a least, usable. If you are not aware of the -fp-model flag, your results may be completely wrong...

Thank you.

5 Replies

Highlighted
##

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-12-2017
06:10 PM

77 Views

You may need to define sum as double to find out whether batched sum may be the more accurate.

Highlighted
##

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2017
05:15 AM

77 Views

Try: -no-fast-transcendentals

This then should use higher precision functions for exp and pow.

Note, -fp-model strict may (should) imply -no-fast-transcendentals, but I cannot attest to this.

Jim Dempsey

Highlighted
##

>>...In order to get the good result, -fp-model precise flag needs to be added to the command line.
Options **-fp-model precise** and, for example **-fp-model fast=2**, are not identical and results will be different.
>>...If you are not aware of the -fp-model flag, your results may be completely wrong...
A developer should know that results could be less accurate ( not completely wrong... ). Also, as **Tim P** mentioned saving intermediate results of calculations in variable(s) of double-precision type(s) always improves accuracy of computations.
>>...I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213"
It is **expected behavior** because different FPU modes ( when **-fp-model** options are used ) don't guarantee identical results of computations.

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2017
10:10 AM

77 Views

Highlighted
##

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2017
10:26 AM

77 Views

As Jim implied, the fast-transcendentals option (short vector math functions) may reduce precision of pow() by as much as 4 Ulp, but that should show up only beyond 14th decimal place. fast-transcendentals is implied by fp-model fast. I would be surprised if icpc took advantage of C++ to replace pow() and exp() by powf() and expf(), which might show differences in the 5th decimal. I do find the various C++ standards and practices confusing on this point.

The varying precision of float sum reduction with order of evaluation is entirely likely to explain the observed differences. If you want the least accurate result, corresponding to strict application of the language rule against re-association, that clearly doesn't correspond to -fp-model fast.

You could check the differences due to vectorization one loop at a time with #pragma novector. g++ will not vectorize any of this without at least -ffast-math, loosely corresponding to the icpc option -fp-model fast.

Highlighted
##

Here are notes I've recently made when investigating some accuracy issues.
...
// - If for CMMA -fp-model fast=2 used accuracy is affected. For example,
// **[ SP FP Processing - 8192 x 8192 - LPS=ijk - CS=ij:ik:jk ]**
// ...
// Matrix C first four elements:
// **8192.099609 8192.049805 8192.099609 8192.049805** **( -fp-model fast=2 )**
// vs.
// **8192.003906 8192.003906 8192.003906 8192.003906** **( -fp-model precise )**
// vs.
// accuracy also improves when an intermediate variable **sum** is
// declared as DP FP type ( double ):
//
// **8192.082031 8192.082031 8192.082031 8192.082031**
//
// **Compiled**:
// icpc -O3 -xMIC-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 test14.c -o test14.out
//
// **Abbreviations**:
// MKL - Math Kernel Library
// CBLAS - C Basic Linear Algebra Subprograms
// CMMA - Classic Matrix Multiplication Algorithm
// LPS - Loop Processing Schema
// CS - Compute Schema
// FP - Floating Point
// SP - Single Precision
// DP - Double Precision
...

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2017
12:17 PM

77 Views

For more complete information about compiler optimizations, see our Optimization Notice.