C++ -fast option causes SIGSEGV but no other optimization flags

Len_Testa · ‎01-12-2012

I have a C++ program compiled with version 12.0.4 and running in a VMWare Ubuntu 11.10 environment, which produces a SIGSEGV error only when compiled with the -fast option. I'm looking for help in trying to determine whether this is a code or compiler issue.

I have a specific set of input which produces this SIGSEGV only when compiled with the -fast option. Code compiled with other optimization flags (-xK or -xW or -xP or -xN, or no flags) does not produce the SIGSEGV.

When run under Valgrind, the code generates no warnings or errors.

When I try to run the -fast code in the Intel debugger, I get the following message after the SIGSEGV:

"This is an unexpected condition and may indicate the presence of a defect."

Through manual debugging, I've got the issue narrowed down to where a particular method call seems to overwrite the metadata in a C++ vector, causing the runtime information about the length of the vector to be clobbered. However, the method doesn't use or manipulate the C++ vector. I think they just happen to be near each other in memory.

Normally I'd keep debugging into that method call. However, Valgrind doesn't report errors and the debugger rolls over and dies. The method is fairly complicated and runs for a long time, so debugging via print statements isn't feasible.

Any other advice on how to figure this out or work around it?

Thanks!

Len

Georg_Z_Intel · ‎01-12-2012

Hello Len,

reading your description I think it's not possible to get a (simple) reproducer, is it?

What I'm doing in cases like that is compiling the application with debug information (-g) first. Does the error occur there, too? If yes, you can use the application with the following steps. Otherwise it's getting really tricky...
Next, I'm starting IDB (Intel Debugger) and let the application run till the SIGSEGV occurs. Then you get the full context and mostly see directly what's wrong.
However, I also had cases where errors were more complex and data races or memory leaks could be the reason. Therefore I'm using tools like Intel Inspector XE 2011 to exclude those problems, mostly before digging down deeper. You said you used Valgrind - is your application multi-threaded? If yes, Valgrind is not sufficient (data races are not detected, for example).

To develop an understanding of the error itself: Does it always occur in the same context? If not, then it's a "heisen-bug" resulting from uninitialized memory & data races.

If you were not able to directly deduce the reason of the error (following the steps above) it's getting complex and it's hard to help you w/o detailed data. For identifying the problem one also needs a very good understanding of the application at hand... which I don't have.

You might do private replies when providing more (critical) data. In the end we can summarize publicly whether this is a compiler bug or not.
Would this be OK for you?

Thanks & best regards,

Georg Zitzlsberger

Len_Testa · ‎01-12-2012

Thank you, Georg.

Thankfully, the error happens when I use both the -g and -fast options, and the error always occurs in the same context. The entire application is single-threaded.

I've just tried running the code in the Intel Inspector XE 2011. The code runs for a bit and then I get the following error message:

Memory error analysis: started
Result file: [./r001mi3/r001mi3.inspxe]
Error: An internal error has occurred. Our apologies for this inconvenience. Please gather a description of the steps leading up to the problem and contact the Intel customer support team.
Application exit code: [255]

I get a similar kind of "unexpected error" message when I run the code in the Intel debugger, too. This makes it difficult to track down the error.

I also tried upgrading to 12.1.0, and the issue still appears.

I'm happy to provide the results file from Inspector, if needed.

Thanks!

Len

ETA: Info on the upgrade to 12.1.0 and threads.

Brandon_H_Intel · ‎01-12-2012

Hi Len,

On Linux, these are the options that -fast implies (and only these):

-ipo, -O3, -no-prec-div, -static, and -xHost

So I would recommend replacing -fast with this set of options, and then reducing the options until you have the minimum set that triggers the error. That will help us in diagnosing this and may give you better ideas for working around it when you see what options are actually at the bottom of this.

SergeyKostrov · ‎01-12-2012

Quoting Len Testa

I have a C++ program compiled with version 12.0.4 and running in a VMWare Ubuntu 11.10 environment, which produces a SIGSEGV error only when compiled with the -fast option... I'm looking for help in trying to determine whether this is a code or compiler issue.

[SergeyK] Please try tocheck your codes and, for example,increase size ofALL buffers for
data( in2x or 4x ).
...
#define SIGSEGV 11 /* segment violation */
...

In a world of 16-bit programming it was called so. Now it ts called as Access Violation with a
Hex-code 0xC0000005. Did you see it?

"This is an unexpected condition and may indicate the presence of a defect."

[SergeyK] I would check for a possible 'buffer overrun' case.

Through manual debugging, I've got the issue narrowed down to where a particular method call seems to overwrite the metadata in a C++ vector, causing the runtime information about the length of the vector to be clobbered. However, the method doesn't use or manipulate the C++ vector. I think they just happen to be near each other in memory.

[SergeyK] It looks like some codes try to write outside of theallowed boundaries.

Normally I'd keep debugging into that method call. However, Valgrind doesn't report errors and the debugger rolls over and dies.

[SergeyK] Stack was corrupted. I'm sure for 100% because I had a similar case with MS Debugger
on aWindows CE platform.

The method is fairly complicated and runs for a long time,

[SergeyK] Did you do any verification for Memory Leaks?
How much memory is allocated for ALL buffers?
Are there any verifications for pointers returned fromcalls to malloc CRT-function?

so debugging via print statements isn't feasible.

Any other advice on how to figure this out or work around it?

[SergeyK] I would try to comment some blocks of codes in order to find a root-cause of the problem.
Ideal roadmap is as follows:

- comment a call to your processing method
- test codes( with different optimizations )
- if it fails, call to Intel, if no go to the next step
- uncomment some small block of codes
- test codes( with different optimizations )
- if it fails, call to Intel, if no go to the next step
- uncomment some small block of codes
- and so on, untill you find a root cause of the problem

Thanks!

Len

Best regards,
Sergey

Georg_Z_Intel · ‎01-13-2012

Hello Len,

I'm sorry to hear that our tools failed for you. On the other side I'm surprised about this because they both use different technology...
You said you did "manual debugging" before. Were you using GDB? Does debugging work with it or does it fail, too? If it's not working either then the problem is system related and the immanent problem has been unveiled because the compiler creates (valid) code differently.

Could you coarsely describe your project? Is it build and run on the same system? What Linux version are you using and what's the underlying architecture?
Another thing to check would be to statically link the libraries that come with the compiler. If you build & execute the application in different environments different sets of compiler libraries might be used unintentionally. Just to be sure, use option -static-intel.
Could you also provide (privately?) the output of ldd tool for your application?

Thanks & best regards,

Georg Zitzlsberger

Len_Testa · ‎01-13-2012

Hi Brandon,

Ah, this was helpful. Thank you.

The issue occurs with the -O3 flag. If I remove only it and use "-ipo -no-prec-div -static -xHost" the code compiles and runs cleanly.

The code also compiles and runs cleanly if I use the -O2 flag, as in "-ipo -O2 -no-prec-div -static -xHost."

That's a decent workaround for me, since I get most of the performance improvements from the compiler. I'd still like to figure out the root cause, though.

Below is the CPU info from my machine, in case that helps. I'm compiling and running on the same machine and environment.

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel Core i7-2630QM CPU @ 2.00GHz
stepping : 7
cpu MHz : 1995.510
cache size : 6144 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida arat epb pln pts dts
bogomips : 3991.02
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

Thanks again,

Len

Brandon_H_Intel · ‎01-13-2012

I think at this point to pursue the compiler issue with -O3, I would really recommend submitting an issue through Premier Support at http://premier.intel.com. You should have the permissions to submit a compiler issue if you've registered your compiler product. If you don't, let me know.

If you can send us the code via Premier, then we can take it from there I think. If you can't, the next step is to determine which source file is actually causing the problem. You should be able to use binary deduction to determine this (build half the objects with -O2, half with -O3, see if there's a problem, and so on until you have one object built with -O3 and everything else with -O2 that exhibits the problem).

SergeyKostrov · ‎01-13-2012

Quoting Len Testa

...
That's a decent workaround for me, since I get most of the performance improvements from the compiler. I'd still like to figure out the root cause, though.
...

Take into account, that a verysmallset of CRT-functionsis returning 'SIGSEGV' error or do some
verifications for that error. Remember, this is not anexception (!) andyour operating systemmaps some
exceptions to some signal-errors. A memoryAccess Violationexception is mappedto 'SIGSEGV' error.

1.
If you have a complete set of sources for ALL CRT-functions I would try build in Debug configuration a
static multithreaded library of CRT-functions and than I would build theapplication that uses that static
multithreaded library.

Than I wouldstartthe application in a Debugger with a Breakpoint on a'signal' CRT-function.

In that case as soon as the'SIGSEGV'error happens itshould hit the Breakpoint and you will be able to
see a Call Stack.

2.
You could alsodefine your own C-handler for processing all signal-errors, like:

Note: This isan example of processingsome FPU-errors using a'SignalHandler'.

...
jmp_buf jmpbuf; // Address for long jump to jump to

RTint g_iSignalCode = 0; // Signal Code
RTint g_iSignalExtCode = 0; // Signal Extended Code

RTvoid RTcdecl SignalHandler( PRTint iSignalCode, PRTint iSignalExtCode );

RTvoid RTcdecl SignalHandler( PRTint iSignalCode, PRTint iSignalExtCode )
{
g_iSignalCode = iSignalCode; // Saves the Signal & Signal Extended Codes
g_iSignalExtCode = iSignalExtCode;
...
CrtLongjmp( jmpbuf, -1 ); // Restores calling environment, returns to 'CrtSetjump' with a code -1
}

RTvoid SignalInfo( RTvoid );

RTvoid SignalInfo( RTvoid )
{
CrtPrintf( RTU("Signal Code : %ld\nSignal Extended Code: %ld\n"),
( RTint )g_iSignalCode, ( RTint )g_iSignalExtCode );
CrtPrintf( RTU("Detected Floating-Point Error: ") );

switch( g_iSignalExtCode )
{
...
case _RTFPE_ZERODIVIDE:
CrtPrintf( RTU("Divide by Zero\n") );
break;
...
default:
CrtPrintf( RTU("Unknown or Unsupported\n") );
break;
}
}
...
RTvoid CrtMain( RTvoid )
{
...
if( CrtSignal( _RTSIGFPE, ( RTvoid ( RTcdecl * )( PRTint ) )SignalHandler ) == _RTSIGERR )
{
...
}
...
if( bOk == RTtrue )
{
jmpret = CrtSetjmp( jmpbuf );
if( jmpret == 0 )
{
...
fR = fV1 / 0.0f; // Test-Case 3 -Enforce Divide by Zero
}
else
{
SignalInfo();
}
}
...
}

Len_Testa · ‎01-17-2012

The issue is resolved and it was not a compiler error. As Sergey thought, it ended up being a relatively simple case of the code stepping one past the end of an array. This, with certain compiler flags set, corrupted the length of a vector variable and caused the SIGSEGV.

The interesting thing, to me anway, is that neither gdb, Valgrind nor the Intel debugger correctly flagged the offending lines or caught the error when it happened, if I was using the -fast option.

I ended up doing something along the lines of Brandon's advice, where I started with a base set of compiler flags and went through a cycle of adding printf statements while running gdb to catch the error.

Eventually I narrowed the error down to something like this:

1: printf "The size of the vector is " << vector.size();
2: z = SomeFunction(vector, arg2, arg3, arg4);
3: // never gets to this point
4:
5: void SomeFunction(v, arg2, arg3, arg4)
6: {
7: printf "The size of the vector is" << v.size();
8: // more code

Line 1 correctly printed the size of the vector. The next executed line of code is like 7 in SomeFunction. Here, though, the size of the vector was always corrupted.

From there I went back and looked at all of the variables passed to SomeFunction which were also manipulated just prior to calling SomeFunction. It was there that I found the array-bounds issue.

I've not checked whether idb works, but I'll do so and post results shortly.

I'd like to thank Brandon, Sergey and Georg for their help with this. I'd also like to point out that my code runs 4x faster with the Intel C++ compiler than with gcc. Intel has both great products and very smart, helpful people.

Len

SergeyKostrov · ‎01-17-2012

Quoting Len Testa

The issue is resolved and it was not a compiler error. As Sergey thought, it ended up being a relatively simple case of the code stepping one past the end of an array. This, with certain compiler flags set, corrupted the length of a vector variable and caused the SIGSEGV.

...

The interesting thing, to me anway, is that neither gdb, Valgrind nor the Intel debugger correctly flagged the offending lines or caught the error when it happened, if I was using the -fast option.

[SergeyK] If you would use a Visual Studio ( any version since 1994 ) it would detect it in a Debug
configuration and launche a JIT MS Debugger. In a Releasethe applicationwould
crached with an Access Violation exception. Please believe me.

Note: MS Software Developers created a very good Memory Leaks & Out-of-Bounddetection
API and the core of it is a'malloc_dbg' CRT-function. You can read about it on MSDN if
interested.

JIT - Just-In-Time

...

Len

SergeyKostrov · ‎01-18-2012

Quoting Len Testa

The issue is resolved and it was not a compiler error. As Sergey thought, it ended up being a relatively simple case of the code stepping one past the end of an array. This, with certain compiler flags set, corrupted the length of a vector variable and caused the SIGSEGV.
...

I'd like to understand the following: What CRT-function hasreturned the SIGSEGV error code?

A 'signal' CRT-functiondoes error handling. But it has to be another CRT-function, or some piece of
codes,that hasset SIGSEGV error in your application (or in OS? ).

Did you have a chance to look at it? Thanks in advance.

Best regards,
Sergey

Georg_Z_Intel · ‎01-19-2012

Hello Len,

I'd still be interested in the reason why both IDB & Insepctor XE did not work for you. It does not surprise me that the SEGV has been caused by a buffer overflow, though. During my work I had to deal with those far too often. However, our tools should assist in such cases. As they failed for you I'd like to tackle that case to not make this happen again.

Maybe now as you know you can easily create a reproducer. Or verify that it's more a general configuration problem and not related to the SEGV at all.

Any feedback from you would be kindly appreciated.

Thank you very much,

Georg Zitzlsberger

SergeyKostrov · ‎01-23-2012

Quoting Georg Zitzlsberger (Intel)

...It does not surprise me that the SEGV has been caused by a buffer overflow...

This is a short follow up in order to clear one little thing:

...
* SIGSEGV, SIGILL and SIGFPE all have more than one exception mapped
* to them.
...

C++ -fast option causes SIGSEGV but no other optimization flags do