Possible compiler bug? - Page 2

Jacob_Williams · ‎12-17-2015

I'm seeing some weird behavior on certain Linux systems for code that works fine on Windows (and also on other Linux systems). The example code is attached (reproducer.f90). It has a class that contains a function pointer that is being associated to a subroutine that is contained within another subroutine. My understanding is that this is valid?

My system is: HP DL360 G6, Intel(R) Xeon(R) X5570, CentOS 6. I'm using Intel 16.0.1 20151021.

Compile with: ifort -g -traceback reproducer.f90 -o reproducer

Running it crashes (Not sure why I'm not getting the line numbers. What do I have to do to get those?):

$ ./reproducer
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
reproducer         0000000000477705  Unknown               Unknown  Unknown
reproducer         00000000004754C7  Unknown               Unknown  Unknown
reproducer         0000000000444DB4  Unknown               Unknown  Unknown
reproducer         0000000000444BC6  Unknown               Unknown  Unknown
reproducer         0000000000425CC6  Unknown               Unknown  Unknown
reproducer         00000000004032B0  Unknown               Unknown  Unknown
libpthread.so.0    0000003A0460F790  Unknown               Unknown  Unknown
Unknown            00007FFE9F1AD658  Unknown               Unknown  Unknown

However, running it in a debugger, it works fine:

$ gdb-ia ./reproducer

(gdb) run
Starting program: reproducer
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
   3.00000000000000
[Inferior 1 (process 14138) exited normally]
(gdb)

Also interesting is that if I use "set disable-randomization off" in the debugger, it crashes again:

$ gdb-ia ./reproducer

(gdb) set disable-randomization off
(gdb) run
Starting program: reproducer
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Catchpoint -2 (signal SIGSEGV), 0x00007fff187ab458 in ?? ()

(gdb) where
#0  0x00007fff187ab458 in ?? ()
#1  0x00000000004030d5 in my_module::my_test () at reproducer.f90:42
#2  0x0000000000403175 in test () at reproducer.f90:71
#3  0x0000000000402e1e in main ()
#4  0x0000003a03a1ed5d in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000402d29 in _start ()
(gdb)

Any ideas? I think the code is valid, so maybe it's a compiler bug? However, since it does work on another similar Linux systems, I'm wondering if it could be some system-specific setting, but I don't know what would cause such behavior. (I normally work with the Windows compiler, so maybe more Linux-savvy folks can point me in the right direction).

Jacob_Williams · ‎12-21-2015

Maybe some sort of "security" feature on the system perhaps? I'll check with my IT guy and report back if we figure out anything.

Steven_L_Intel1 · ‎12-22-2015

That's what I'm thinking. Since you can take the executable from a system where it works and run it on another system where it fails, that tells me it isn't the compiler.

Rick_R_ · ‎12-23-2015

It is exec-shield causing this.

This was definitely due to us having the sysctl flag kernel.exec-shield=3. This flag sets the XD (execute disabled) bit on the processor (generically this is called the NX (no-execute)).

We ignored this as a problem initially because our old systems with Xeon 5300 processors had it set this way too, however I do not think Linux recognizes this feature on this processor. So the problem only exhibited itself on newer Xeon 5400 and 5500 processors.

We will set kernel.exec-shield=1 as a workaround. I suggest your compiler team gets with those on the processor team who added the XD bit and come up with a solution that does not make your compiled programs behave like malicious code would.

Happy Holidays

Steven_L_Intel1 · ‎12-23-2015

I thought it was something like that. gcc has a similar issue with what it calls "trampolines". Interestingly, I had thought we did have a method that avoided the issue of executing stack code, but it seems not.

Thanks for getting back to us with the resolution - but I can see that this is a potential issue going forward and will see what I can do to raise the visibility. Maybe we can come up with something else (though it will probably be slower.)