- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
While trying to compare parallelism between OMP, MPI, Cuda and OpenACC, I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213". The code below is solving the classical temperature distribution in 2D (just for test):
// /opt/intel/bin/icpc -O2 main.cpp && time ./a.out #include <iostream> #include <cmath> int main() { const int nx = 2800; const int ny = 2800; const float lx = 1.f; const float ly = 1.f; float dx = lx / (float)nx; float dy = ly / (float)ny; const float lambda = 0.001f; float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f); float dtdx = dt*lambda/dx/dx/12.f; float dtdy = dt*lambda/dy/dy/12.f; int nxt = nx + 4; int nyt = ny + 4; float* T = new float[nxt*nyt]; float* Told = new float[nxt*nyt]; for (int i = 0; i < nxt; i++) { for (int j = 0; j < nyt; j++) { int ind = i*nyt + j; T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2))); Told[ind] = T[ind]; } } for (int step=0; step<1000; step++) { for (int i = 2; i < nxt-2; i++) { for (int j = 2; j < nyt-2; j++) { int ind = i*nyt + j; T[ind] = Told[ind] + dtdx * (-Told[(i-2)*nyt + (j)]+16.f*Told[(i-1)*nyt + (j)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i+1)*nyt + (j)]-Told[(i+2)*nyt + (j)]) + dtdy * (-Told[(i)*nyt + (j-2)]+16.f*Told[(i)*nyt + (j-1)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i)*nyt + (j+1)]-Told[(i)*nyt + (j+2)]); } } for (int i = 0; i < nxt*nyt; i++) { Told = T; } } float sum = 0.0f; for (int i = 0; i < nxt*nyt; i++) { sum += T; } std::cout << sum/(float)(nxt*nyt) << std::endl; return 0; }
After 1000 "time" iterations, the code is supposed to give the results: 1.08712 (using O0 or O1 compile flag)
Without any parallelism (no openmp), the code is compiled with O2 optimisation and gives the following (wrong) results: 1.09783
/opt/intel/bin/icpc -O2 main.cpp && time ./a.out
Using LLVM C++ (Apple) or GNU G++, with O2 or even O3 optimization flag, gives good results (1.08712).
It seems that Intel compiler with O2 or O3 compile flag does aggressive optimizations on floating-point data. In order to get the good result, -fp-model precise flag needs to be added to the command line.
Why does the O2 flag create such aggressive optimizations? (while GNU or LLVM does not) I mean, I know that O3 can be a dangerous flag to use but I thought O2 was, a least, usable. If you are not aware of the -fp-model flag, your results may be completely wrong...
Thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may need to define sum as double to find out whether batched sum may be the more accurate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try: -no-fast-transcendentals
This then should use higher precision functions for exp and pow.
Note, -fp-model strict may (should) imply -no-fast-transcendentals, but I cannot attest to this.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As Jim implied, the fast-transcendentals option (short vector math functions) may reduce precision of pow() by as much as 4 Ulp, but that should show up only beyond 14th decimal place. fast-transcendentals is implied by fp-model fast. I would be surprised if icpc took advantage of C++ to replace pow() and exp() by powf() and expf(), which might show differences in the 5th decimal. I do find the various C++ standards and practices confusing on this point.
The varying precision of float sum reduction with order of evaluation is entirely likely to explain the observed differences. If you want the least accurate result, corresponding to strict application of the language rule against re-association, that clearly doesn't correspond to -fp-model fast.
You could check the differences due to vectorization one loop at a time with #pragma novector. g++ will not vectorize any of this without at least -ffast-math, loosely corresponding to the icpc option -fp-model fast.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page