Intel compiler updates - one step forward - three steps back

Petr_K_ · ‎07-15-2012

I work for a major CFD vendor and recent compiler updates make us looking for an alternative compiler as fast as we can. Almost every update brings in new bugs - between internal compiler errors, that can be mostly handled by slightly changing the code and at least there is a clear indication what is going wrong to the worst possible thing - miscompiled code.Out software has over two millions lines of complex C++ code. The recent Intel 12.1 has bugs that are really bad and extremely hard to find because they only occur when /O3 is used. Disabling optimization fixes the problem, but generates impossibly slow code.The most recent bug miscompiles following trivial code:double const c0 = cos(a(0)), s0 = sin(a(0));double const c1 = cos(a(1)), s1 = sin(a(1));double const c2 = cos(a(2)), s2 = sin(a(2));std::cout << a << c0 << " " << c1 << " " << c2 << " " << s0 << " " << s1 << " " << s2 <<std::endl;The a is a structure of three numbers accessed via simple inline function double const& operator()(int const& i) const { return _data; }The output the compiler produces in this case with full optimization is(0.242077 0 0)0.970842 0.970842 0.970842 0.23972 0.23972 0.23972wih global optimization disabled (via #pragma optimize("g",off)the same fragment produces correct output(0.242077 0 0)0.970842 1 1 0.23972 0 0So there is clearly some aliasing logic completely wrong and ignoring anything but the first variable.Should I mention that these bugs are extremely time consuming to find (weeks to be precise) even with very detailed test suite.Optimization flags used:/O3 /Qtemplate-depth-100 /GR /EHsc -QaxAVX

JenniferJ · ‎07-16-2012

with minor update, this should not happen.

what is the icl version you're using? just type "icl", it will print out the sign-on signature.

what is the test data for the a variable? I'd like to try. (if you could attach a small testcase, it would be nice.)
have you tried /fp:precise? it may generate slower-code though.

do you have VS2008 or VS2010?

btw, the latest icl update is the update 11: Intel C++ Intel 64 Compiler XE for applications running on Intel 64, Version 12.1.5.344 Build 20120612

Jennfer

Petr_K_ · ‎07-16-2012

Jennifer,

I already filed a bug with Intel Premier Support (#679279) - they have all the test data and investigating, This is not a floating point problem. It seems some incorrect static flow analysis is going on and registers are re-used or aliased incorrectly. The bug does not depend on the value of the floating point data - it is the indexes that are not interpretted correctly.

Petr

SergeyKostrov · ‎07-18-2012

Quoting Petr Kodl

...This is not a floating point problem. It seems some incorrect static flow analysis is going on and registers are re-used or aliased incorrectly. The bug does not depend on the value of the floating point data - it is the indexes that are not interpretted correctly.

Petr,

A while ago there was some issue with trigonometric functions ( sine or cosine ) and it is possible that your problem is
related to that. I'll try to follow up with more technical detailssome time later.

Best regards,
Sergey

SergeyKostrov · ‎07-18-2012

Quoting Petr Kodl

...The most recent bug miscompiles following trivial code:

double const c0 = cos(a(0)), s0 = sin(a(0));

double const c1 = cos(a(1)), s1 = sin(a(1));

double const c2 = cos(a(2)), s2 = sin(a(2));

std::cout << a << c0 << " " << c1 << " " << c2 << " " << s0 << " " << s1 << " " << s2 <<:ENDL>

The a is a structure of three numbers accessed via simple inline function double const& operator()(int const& i) const { return _data; }

The output the compiler produces in this case with full optimization is

(0.242077 0 0)

0.970842 0.970842 0.970842 0.23972 0.23972 0.23972

wih global optimization disabled (via #pragma optimize("g",off)

the same fragment produces correct output

(0.242077 0 0)

0.970842 1 1 0.23972 0 0

[SergeyK] There is inconsistency with a number of variables in the 'std::cout' statement and both outputs.
Simply count variables andnumbers in outputs.

So there is clearly some aliasing logic completely wrong and ignoring anything but the first variable.

Should I mention that these bugs are extremely time consuming to find (weeks to be precise) even with very detailed test suite.

Optimization flags used:

/O3 /Qtemplate-depth-100 /GR /EHsc -QaxAVX

I don't see "an aliasing problem" and I rathersee a roundingproblem. Please take a look atre-formatted outputs:

[cpp]The output the compiler produces in this case with full optimization is ( 0.242077 0 0 ) 0.970842 0.970842 0.970842 0.23972 0.23972 0.23972 wih global optimization disabled (via #pragma optimize("g",off) the same fragment produces correct output ( 0.242077 0 0 ) 0.970842 1 1 0.23972 0 0 [/cpp]

But, I'm not trying to defend Intel C++ compiler becauseinyour case something is wrong.

I was monitoring Intel C++ compiler forum for about 6 months ( 11.2011 - 04.2012 )in order to get some
information on different problems and bugs. I was able to see that in some simple test-cases Intel C++ compiler was
failing and I didn't like it. However, when a real integration of Intel C++ compiler for the project started
in April everything was really smooth and it was completed in about 2 weeks.

Do you think that a "threat" tostop usingIntel C++ compiler is right? If you don't use another C++ compiler on
your project from the beginning a port ofthe two-million-code-lines project could beanother disaster. Isn't that true?

Petr_K_ · ‎07-20-2012

hardly rounding problem - let's make it even simpler

--- code begins ----

a = Vector<3,double>(0,1,2);

double const c0 = cos(a(0)), s0 = sin(a(0));
double const c1 = cos(a(1)), s1 = sin(a(1));
double const c2 = cos(a(2)), s2 = sin(a(2));

std::cout << c0 << " " << c1 << " " << c2 << " " << s0 << " " << s1 << " " << s2 << std::endl;

---- code ends ----

so we expect six distinct numbers, correct?

optimization on

1 1 1 0 0 0

optimization off (#pragma optimize("g",off)

1 0.540302 -0.416147 0 0.841471 0.909297

so there is clearly something very bad going on and unlike most other cases I already reported in this case it does not even include a loop vectorization which is where most of the regressions tend to happen

I submitted five regression reports so far against version 12 - new bugs not present in version 11.1.065 and I think couple more are on the way. The progress is unfortunately slow because it involves multiple runs of complex test cases and selective recompilation and optimization disabling

Petr

Petr_K_ · ‎07-20-2012

In order to keep it even simplere and remove cmplexity

---- code begin ----

class Vec

{

double _data[3];

public:

double const& operator()(unsigned int const& i) const { return _data; }

Vec(double const& a, double const& b, double const& c)

{

_data[0] = a;

_data[1] = b;

_data[2] = c;

}

};

Vec b(0,1,2);

double const c0 = cos(b(0)), s0 = sin(b(0));

double const c1 = cos(b(1)), s1 = sin(b(1));

double const c2 = cos(b(2)), s2 = sin(b(2));

std::cout << c0 << " " << c1 << " " << c2 << " " << s0 << " " << s1 << " " << s2 << std::endl;

----- code end -----

also reproduces the problem when compiled with

icl -O3 /MD /Zi /GR /EHsc-QaxAVX

running on Xeon 5690

unless the #pragma optimize("g",off) is used

so yes - quite disturbing bug

JenniferJ · ‎07-20-2012

Quoting Petr Kodl

I submitted five regression reports so far against version 12 - new bugs not present in version 11.1.065 and I think couple more are on the way. The progress is unfortunately slow because it involves multiple runs of complex test cases and selective recompilation and optimization disabling

Petr

Could you let me know the five ticket numbers?

And thanks for the small testcase. We're investigating this issue right now.

Jennifer

JenniferJ · ‎07-20-2012

Strange, I'm not seen the issue: running on a Sandybridge.

C:\Jennifer\Issues\AVX>icl -O3 /MD /Zi /GR /EHsc -QxAVX avx.cpp
Intel C++ Intel 64 Compiler XE for applications running on Intel 64, Ve
rsion 12.1.5.344 Build 20120612
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.
avx.cpp
Microsoft Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation. All rights reserved.
-out:avx.exe
-debug
-pdb:avx.pdb
avx.obj
C:\Jennifer\Issues\AVX>avx
1 0.540302 -0.416147 0 0.841471 0.909297
C:\Jennifer\Issues\AVX>type avx.cpp
#include
using namespace std;
class Vec
{
double _data[3];
public:
double const& operator()(unsigned int const& i) const { return _data; }
Vec(double const& a, double const& b, double const& c)
{
_data[0] = a;
_data[1] = b;
_data[2] = c;
}
};
int main(int argc, char* argv[])
{
Vec b(0,1,2);
double const c0 = cos(b(0)), s0 = sin(b(0));
double const c1 = cos(b(1)), s1 = sin(b(1));
double const c2 = cos(b(2)), s2 = sin(b(2));
std::cout << c0 << " " << c1 << " " << c2 << " " << s0 << " " << s1 << "
" << s2 << std::endl;
return 0;
}
C:\Jennifer\Issues\AVX>

Petr_K_ · ‎07-20-2012

interesting - can you attach your exe and see if there is a difference in the code - I ran the code again and it prints

1 1 1 0 0 0

Petr

the ticket number is

679279

I included only the flags that should be affecting optimizations, but maybe the full set could make a difference:

icl -DNDEBUG /DWIN32 /D_WINDOWS /DNOMINMAX /D_MBCS /D_CRT_SECURE_NO_DEPRECATE /D_CRT_NONSTDC_NO_DEPRECATE /D_HAS_ITERATOR_DEBUGGING=0 /D_SECURE_SCL=0 /Qvc10 /D_AMD64_ /DWIN64 /D_WIN64

-O3 /Qeffc++ /Qtemplate-depth-100 /MD /Zi /nologo /GR /EHsc /Qoption,cpp,--suppress_base_class_export /Qwd1478 /Qwd1740 /Qwd1744 /Qvec-report0 /Qwd2012 /Qwd2013 /Qwd2014 /Qwd2015 /Qwd2017

/Qwd2021 /Qwd2022 /Qwd2025 /Qwd2027 /Qwd2047 /vmg /Qwd2304 /Qwd2305 /Qdiag-error:117 /Qdiag-error:1011 /Qdiag-warning:177,593 /Qdiag-error:409 /Qwd241

main.cpp

Petr_K_ · ‎07-20-2012

I added a test file that triggers the error - strange thing is that if you comment out the line marked that is actually not even executed until the test runs something changes and it works fine, so it looks like the minumum code size is involved somehow.

just in case the attachments do not work:

---- code begin -----

# include

static int

findArg(char const *arg, int argc, char *argv[])

{

for (int i = 1; i < argc; ++i)

if (strcmp(argv, arg) == 0)

return i;

return 0;

}

class Vec

{

double _data[3];

public:

double const& operator()(unsigned int const& i) const { return _data; }

Vec(double const& a, double const& b, double const& c)

{

_data[0] = a;

_data[1] = b;

_data[2] = c;

}

};

int test()

{

Vec b(0,1,2);

double const c0 = cos(b(0)), s0 = sin(b(0));

double const c1 = cos(b(1)), s1 = sin(b(1));

double const c2 = cos(b(2)), s2 = sin(b(2));

std::cout << c0 << " " << c1 << " " << c2 << " " << s0 << " " << s1 << " " << s2 << std::endl;

return 0;

}

int

main(int argc, char* argv[])

{

test();

// comment out next line out and the error stops - this affects the result of the test() call for some reason

bool deleteOnExit = findArg("-deleteOnExit", argc, argv);

return 0;

}

JenniferJ · ‎07-20-2012

can you attach the new test file?

Jennifer

Petr_K_ · ‎07-20-2012

I added it as text - tried to upload it, but attachments did not show up.

JenniferJ · ‎07-20-2012

Yeh, the addition code brought in a lot of stuff from the header. Now Ican duplicate the issue on the sandybridge, and am sending it to the compiler engineering right now.

I'll update you when there is any progress. Thanks for the testcase again.

Jennifer

SergeyKostrov · ‎07-21-2012

Quoting Sergey Kostrov

Quoting Petr Kodl
...This is not a floating point problem. It seems some incorrect static flow analysis is going on and registers are re-used or aliased incorrectly. The bug does not depend on the value of the floating point data - it is the indexes that are not interpretted correctly.

Petr,

A while ago there was some issue with trigonometric functions ( sine or cosine ) and it is possible that your problem is
related to that. I'll try to follow up with more technical detailssome time later...

This is a test-case that was used to reproduce the problem. Please take a look:

[cpp]... // Sub-Test 26 - Sign problem with 'sin' CRT-function { ///* CrtPrintf( RTU("Sub-Test 26n") ); _RTALIGN16 RTfloat fInpVal[12] = { 0.0f, 0.5f, 1.0f, 1.5f, 2.0f, 2.5f, 3.0f, 3.5f, 4.0f, 4.5f, 5.0f, 5.5f }; _RTALIGN16 RTfloat fOutVal[12] = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f }; // A set of values from Windows Calc.exe rounded to 16 digits after the point ( as floats ) _RTALIGN16 RTfloat fChkVal[12] = { // Radians 0.0000000000000000f, // 0.0 0.4794255386042030f, // 0.5 0.8414709848078965f, // 1.0 0.9974949866040544f, // 1.5 0.9092974268256817f, // 2.0 0.5984721441039565f, // 2.5 0.1411200080598672f, // 3.0 -0.3507832276896199f, // 3.5 -0.7568024953079283f, // 4.0 -0.9775301176650971f, // 4.5 -0.9589242746631385f, // 5.0 -0.7055403255703919f // 5.5 }; // A set of values with sign errors ( as floats ) _RTALIGN16 RTfloat fInvVal[12] = { // Radians 0.0000000000000000f, // 0.0 0.4794260000000000f, // 0.5 0.8414710000000000f, // 1.0 0.9974950000000000f, // 1.5 -0.9092970000000000f, // 2.0 <- here the error starts -0.5984720000000000f, // 2.5 -0.1411200000000000f, // 3.0 -0.3507830000000000f, // 3.5 <- here it's okay again -0.7568020000000000f, // 4.0 -0.9775300000000000f, // 4.5 0.9589240000000000f, // 5.0 <- wrong again 0.7055400000000000f // 5.5 }; // A set of values with sign errors ( as strings ) RTchar *szInvVal[12] = { // Radians " 0.0000000000000000", // 0.0 " 0.4794260000000000", // 0.5 " 0.8414710000000000", // 1.0 " 0.9974950000000000", // 1.5 "-0.9092970000000000", // 2.0 <- here the error starts "-0.5984720000000000", // 2.5 "-0.1411200000000000", // 3.0 "-0.3507830000000000", // 3.5 <- here it's okay again "-0.7568020000000000", // 4.0 "-0.9775300000000000", // 4.5 " 0.9589240000000000", // 5.0 <- wrong again " 0.7055400000000000" // 5.5 }; RTint i; for( i = 0; i < 12; i++ ) { // fOutVal = CrtSin( fInpVal ); fOutVal = SinNTS11( fInpVal, RTfalse ); } for( i = 0; i < 12 ; i++ ) { CrtPrintf( RTU("%2ld % 2f % .16f % .16ftAbsError 1: % .10ftAbsError 2: % .10fn"), ( RTint )i, fInpVal, fOutVal, fChkVal, ( fOutVal - fChkVal ), ( fInvVal - fChkVal ) ); // CrtPrintf( RTU("%2ld % 2f % .16f % .16fn"), // ( RTint )i, fInpVal, fOutVal, fChkVal ); // CrtPrintf( RTU("%2ld % 2f % .16fn"), // ( RTint )i, fInpVal, fOutVal ); } */ } ... [/cpp]

>> Output <<
> Test1017 Start <
Sub-Test 26
0 0.000000 0.0000000000000000 0.0000000000000000 AbsError 1: 0.0000000000 AbsError 2: 0.0000000000
1 0.500000 0.4794255495071411 0.4794255495071411 AbsError 1: 0.0000000000 AbsError 2: 0.0000004470
2 1.000000 0.8414709568023682 0.8414709568023682 AbsError 1: 0.0000000000 AbsError 2: 0.0000000596
3 1.500000 0.9974949359893799 0.9974949955940247 AbsError 1: -0.0000000596 AbsError 2: 0.0000000000
4 2.000000 0.9092961549758911 0.9092974066734314 AbsError 1: -0.0000012517 AbsError 2: -1.8185943961
5 2.500000 0.5984489321708679 0.5984721183776856 AbsError 1: -0.0000231862 AbsError 2: -1.1969441175
6 3.000000 0.1408745944499970 0.1411200016736984 AbsError 1: -0.0002454072 AbsError 2: -0.2822400033
7 3.500000 -0.3525765836238861 -0.3507832288742065 AbsError 1: -0.0017933547 AbsError 2: 0.0000002384
8 4.000000 -0.7668045759201050 -0.7568024992942810 AbsError 1: -0.0100020766 AbsError 2: 0.0000004768
9 4.500000 -1.0228917598724365 -0.9775301218032837 AbsError 1: -0.0453616381 AbsError 2: 0.0000001192
10 5.000000 -1.1336172819137573 -0.9589242935180664 AbsError 1: -0.1746929884 AbsError 2: 1.9178482890
11 5.500000 -1.2947624921798706 -0.7055402994155884 AbsError 1: -0.5892221928 AbsError 2: 1.4110803008
> Test1017 End <

Bernard · ‎07-22-2012

A while ago there was some issue with trigonometric functions ( sine

If I remember the problem was related to thebuilt-in x87fsin instruction.MSVCRT sin() functions calls x87 fsin to calculate the values of sine function.
Problem was centered around the fsin range reduction algorithm which used 53-bit precision approximation to thevalue of Pi.Dividing infinite precision transcendental Pi which was approximated to some point by 53-bit value by the number representing a periods of sine functions probably induced some error which manifested itself as a shift from the true value.

Btw Java.Math class does range reduction properly and feeds fsin with the reduced [-Pi/4,Pi/4] range values.

TimP · ‎07-22-2012

x87 fsin uses the 64-bit (80-bit long double) precision approximation for Pi. It's advertised as 66-bit accurate, as the next 2 bits happen to be 0. fsin is not affected by precision mode. fsin itself skips the range reduction if the argument magnitude exceeds 2^64. In most applications, such a large argument would indicate a failure of some kind has already occurred.
A work-around was to reduce range by an exact multiple of the long double approximation to Pi, which is supported by x87 remaindering, but doesn't improve accuracy, except in the sense of producing a bounded value for the result. Naturally, people ran into surprises with extreme values when they switched between fsin and a software library range reduction.
You may be speaking of a particular implementation of Java, although of course Java went to some lengths to avoid numerical differences among implementations.
Modern commercial compilers like Intel's avoid fsin entirely, unless you are compiling for 32-bit mode and specifically request x87 code. The time required to call fsin is unacceptable in many applications.
IBM compilers are advertised as using a fully accurate algorithm for trigonometric range reduction, where I believe the run time becomes quite large for huge argument magnitudes.

Petr_K_ · ‎07-22-2012

I think the basic premise here is that this is a floating point problem - I do not think it is. The problem is present regardless of the values assigned to the initial vector. The floating point just helps to trigger it because some advanced optimizations kick in.

From my brief analysis it seems to have something to do with overloaded inlined operator() - when the vector class is replaced with POD array[] it goes away. So the compiler seems to be confused by the overloaded () and decides to use the first value regardless of the index.

I'll wait what Intel compiler team will come up with.

Petr

Bernard · ‎07-22-2012

>>You may be speaking of a particular implementation of Java, although of course Java went to some lengths to avoid numerical differences among implementations

I'm talking about Java 5 or 6.Programmers who wrote Java.Math classgot unsatisfactionary results(very high absolute error of 5-6 decimal places)when Math.sin() was binary translated by JIT compilerto fsin.The problem was the range reductionimplemented by fsin.In order to achievemore accurate results software range-reduction algorithm was developed.AFAIK Every Javaimplementation from the version 4 or 5 for sine calculation uses Java StrictMath library which in turn is based on FDLIBM 5 implementation.

>>The time required to call fsin is unacceptable in many applications

VML sin() can achieve on average ~21 cycles per element for randomaly choosen 1000-element vector so it is 2x times faster than scalar fsin.Interesting how the range reduction is implemented and how the Pi is approximated that's mean at what precision?

SergeyKostrov · ‎07-22-2012

Quoting iliyapolak

A while ago there was some issue with trigonometric functions ( sine
If I remember the problem was related to thebuilt-in x87fsin instruction...

I've found a thread and please take a look:

Vectorization of sin/cos results in wrong values
February 8th, 2012
http://software.intel.com/en-us/forums/showthread.php?t=102930&o=a&s=lr

Posts #34, #39 and #40 are final statementsfrom Intel Software Engineers regarding the problem.

Best regards,
Sergey

Bernard · ‎07-22-2012

Vectorization of sin/cos results in wrong values
February 8th, 2012
http://software.intel.com/en-us/forums/showthread.php?t=102930&o=a&s=lr

Thanks for posting this link it was a great read.

Btw the subject of this discussion is probably VML sin function.
Taylor series for sine function will return an accurate result up to 3 radian.From mathematical point of view it is infinitely convergeable,but when executed on digital computer the upper bound is 3.

intel compiler updates - one step forward - three steps back