Solved: Catastrophic error in compiler appears from version 088 onwards

fercook · ‎10-18-2010

I am working on a rather large project with many classes and subclasses. Everything works great (have a test suite) and compiles fine under version 11.1.084. However, if I switch to version 088 or 089 I get an error when compiling some of the classes modules:

: catastrophic error: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error.

compilation aborted for MPSTensor_Class.f90 (code 3)

make: *** [MPSTensor_Class.o] Error 3

No further messages are given. I could not find anything wrong in the code (as I said, it compiles and runs fine in version 084). Since the project is kind of large,I could not identify what operation triggers the error. I would have loved to post a reproducer code but I already spent a week tracing this bug and must come back to developing...would you like me to send you the whole project? Do you have any possible related threads that I missed (there was one about a year ago but it seemed fixed).

Kevin_D_Intel · ‎10-18-2010

Thanks for the file Fernando. The internal error is reproducible with the current 11.1.088 (and later) as reported and also with our upcoming major release next month.

I reported this to Development (see internal tracking id below) and will update again with any available work-around. It doesn't appear any compiler option is able to avoid the error so it is probably easiest to stick with 11.1.084 for now.

(Internal tracking id: DPD200161823)

(Resolution Update on 04/27/2011): This defect is fixed in the Intel Fortran Composer XE 2011 Update 3 (2011.3.167 - Mac OS)

Here's a pruned reproducer from Tensor_Class_fun.f90. Line 13 triggers the error. There may be a source-code work-around.

[fortran]module Tensor_Class_fun
 use Tensor_Class

implicit none
private

 contains

 subroutine tensor4_joinindices
  type(tensor2) :: aMatrix,correct
  integer error,i,j,k,l

  error=correct%Delete()
 end subroutine tensor4_joinindices
end module Tensor_Class_fun[/fortran]

View solution in original post

mecej4 · ‎10-18-2010

It may be useful to see the source code in MPSTensor_Class.f90 (and of any modules USEd in it).

jimdempseyatthecove · ‎10-18-2010

Does the compilation fail in Debug build?
In particular, does it fail with optimizations disabled?

Additional area to look at

Both 32-bit and 64-bit compiles may experience a segmentation fault when the static data of any linker segment (code, data, ...) exceeds 2GB. Would this file (MPSTensor_Class.f90) create such a segment?

Jim Dempsey

fercook · ‎10-18-2010

Hi Jim,

yes, it fails with all optimizations disabled.

How can I know the size of the linker segments? Why would it change from version 084 to 088 and 089?

Thanks,

Fernando

fercook · ‎10-18-2010

Well, it's quite a bit of code, but I understand little can be done without inspecting it.

So, here it is, the MPSTensor_Class and all modules used by it. I am also including the Test Units, which use NASA's FUnit framework. The compilation order is

constants.f90 error.f90 Tensor_Class.f90 Operator_Class.f90 MPSTensor.helper.f90 MPSTensor_Class.f90 Tensor_Class_fun.f90

Also notice that LAPACK is required to compile Tensor_Class, either through the MKL or with the vecLib framework in Mac OS X. The helper file is an artifact that I need to include for FUnit, not a real part of my program. FUnit is also the reason why all the modules are public instead of private, but I do need to test.

The important module here is Tensor_Class.f90 (which now boasts 3K lines and is first in queue to be refactored into smaller classes/files), MPSTensor only uses a small set of things from Tensor.

While I was preparing this zip, I realized that the Tensor_Class compiles OK, but the compiler crashes when it attempts to compile the associated Test Unit (which I am including here, it is calledTensor_Class_fun.f90 and TestRunner.f90) If MPSTensor files are not compiled, this can be tested with the attached files. It seems to be the same error as with MPSTensor (a module trying to use the Tensor_Class module).

Again, notice that ALL compiles and runs OK with version 084.

Thanks,

Fernando

Kevin_D_Intel · ‎10-18-2010

It almost certainly a regression with the 11.1.088 compiler. Thanks for the attached file. I will investigate and post with in update soon.

Kevin_D_Intel · ‎10-18-2010

Was the earlier attached file removed?

I'm getting"Page not Found". If you're concerned about general access then you can post a private reply with the source attached or open an Intel Premier (https://premier.intel.com/) issue.

fercook · ‎10-18-2010

No, I did not remove it, but I had a browser crash while posting the article. Everything looked good afterwards, but now I cannot see the file either. I am uploading it as an attachment here, we'll see if it works.

Update: No, it still does not work, and this time the browser is fine. For some reason I can upload and attach files, but then they are not found. I will try in a private reply...

jimdempseyatthecove · ‎10-18-2010

>>yes, it fails with all optimizations disabled.

This shows that it is not an issue with optimizations, in particular Inter-Procedural Optimizations.

>>How can I know the size of the linker segments?
Generate a linker load map (linker option)

>>Why would it change from version 084 to 088 and 089?
Unknown, but asa guess, if your 084 packed data to 4/8 byte boundaries (natural pointer boundary) but 088 onwards packed data to 16 byte boundaries (SSE friendly), or 64-byte (cache friendly) boundaries, then the size of the data segment would grow.

Looking at the linkermap from 088 could tell you if any of the segment sizes were nearing a limit. Note, the segment size in the linker map will report in hexadecimal. A little bit of hex addition may be required to obtain the full size of the data segment. e.g. take the offset to the last data object loaded and add its size (in Hex). MS calculator, view scientific, has a Hex mode. 2GB == 0x80000000

In the map you will see a table with headers

Start LengthName Class

0001:00000000 00123456H .text CODE
...
0002:00000930 00000109H .CRT$XIZ DATA
...

The first 4 digits are the linker segment number. Segment is a leftover nomenclature from the earlier non-flat model programming context (Real Mode Segments, 80286 protected mode selectors). To maintain some transitional compatibility, the load offset, the number following the :, and the Length, 3rd number field, kept a restriction of 32-bits. Depending on command line options this may be signed or unsigned. When signed you have a 2GB limitation on the size of any one of these segments.

As you progress through the different segments, you will see that each begin with a segment offset of 00000000. This is the offset relative to the load point of the segment (not the load point of the program in general).

The image loader may or may not have this same limitation. To get the total image size, look in the next section. Header has

Address Publics by Value Rva+Base Lib:Object

The last item in segment 0000 might read "__ImageBase". With an Rva+Base value. On 64-bit build this might read 0000000140000000.

This is the virtual address of the first thing in your load image.
Scroll down to the end of the load map, locate the Rva+Base value. This will tell you the virtual address of the last object loaded. It will not tell you the size. For now, take this size and subtract the Rva+Base of "__ImageBase".
Or in this case, simply see if the last item Rva+Base exceeds 17FFFFFFF.

Jim Dempsey

Kevin_D_Intel · ‎10-18-2010

Thanks for the file Fernando. The internal error is reproducible with the current 11.1.088 (and later) as reported and also with our upcoming major release next month.

I reported this to Development (see internal tracking id below) and will update again with any available work-around. It doesn't appear any compiler option is able to avoid the error so it is probably easiest to stick with 11.1.084 for now.

(Internal tracking id: DPD200161823)

(Resolution Update on 04/27/2011): This defect is fixed in the Intel Fortran Composer XE 2011 Update 3 (2011.3.167 - Mac OS)

Here's a pruned reproducer from Tensor_Class_fun.f90. Line 13 triggers the error. There may be a source-code work-around.

[fortran]module Tensor_Class_fun
 use Tensor_Class

implicit none
private

 contains

 subroutine tensor4_joinindices
  type(tensor2) :: aMatrix,correct
  integer error,i,j,k,l

  error=correct%Delete()
 end subroutine tensor4_joinindices
end module Tensor_Class_fun[/fortran]

fercook · ‎10-19-2010

Thanks for the tip. I had suspected the delete() line from a similar program, but that one turn out to be a simple bug of things not properly deallocated.

What line 13 does is calling my version of the final routine (by the way, is that supported in new versions?). The derived class contains allocatable components, and the delete() routine takes care of deallocating all of them before killing the object.

I had tried to make a reproducer that did only this: create an object with allocatable components, allocate all of them, use the for something silly, and then deallocate them through the delete() function. It always worked fine, so I think I am missing some ingredient...

In the meantime, I keep using version 084.

On another note: "catastrophic error" ? Were the "absolute doom" or "disastrous calamity" messages taken for something else? ;)

Kevin_D_Intel · ‎10-20-2010

Yes, FINAL will be supported in the next major release coming next month.

fercook · ‎10-25-2010

Yeah, FINALly ! :)

On another note, is there a way to find out if moving the stuff I keep on the delete() routines to FINAL would remove the bug?

Kevin_D_Intel · ‎10-25-2010

Sure. Send me the changes and I will test them. I took a stab at it but I'm not getting everything right and it would be more expeditious to have you send the modified code.

fercook · ‎01-12-2011

Ok, I didnt give up on this -- it was just way too hard to find time to do it. The problem persists with the latest version of the compiler, XE 12.0.1.122. Even more, and what sort of forced me to go back to this, is that even with previous versions of the compiler (11.084) the error shows up if the debug (-g) option is used. This old version compiles ok without debugging active.

The good news is thatI think I have sort of pinpointed the source of the catastrophic error, and even find a source code workaround it (I want brownie points for this! The reproducer was painstakingly obtained by boiling down a >3K lines file).

The attached files have the following:

One module describes two classes (Tensor2 and Tensor3) that are derived from another class (Tensor). Tensor provides some basic interface (in this example boiled down to only the dimensions of the tensor), and each class has further type-bound procedures (in this reproducer, only Tensor3 has a type-bound function).

The driver module (which causes the catastrophic error) has only one function that simply creates a Tensor3 and then tries to obtain the dimensions.

If the type-bound procedure GetDimensions() is not called, no error shows up. If Tensor3 has no type-bound procedures, no error occurs either. If instead of a polymorphic function I bind a dimension procedure to each class, no error occurs (this is my best code work-around yet).

Notice that GetDimensions returns an allocatable array, this could be the source of the error but I havent tested. So far, I will stick to not using polymorphic functions, which seems to be more or less working for me.

Here is the output from the console:

[bash]$ ifort -V
Intel Fortran Intel 64 Compiler XE for applications running on Intel 64, Version 12.0.1.122 Build 20101110
Copyright (C) 1985-2010 Intel Corporation.  All rights reserved.

$ ifort -c module.f90 
$ ifort -c driver.f90 
driver.f90: catastrophic error: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.
compilation aborted for driver.f90 (code 1)
[/bash]

And module.f90 and driver.f90 are attached.

Can you please get back to me on this, specially if you have some info on when it will be fixed?

Kevin_D_Intel · ‎01-12-2011

Thank you for the efforts in producing the smaller reproducer. I'm certain Development will appreciate it.

Iforwarded your new details and test case to them and requested an update. I will update again when I hear from them.

fercook · ‎01-13-2011

Well, I have further bad news: my source code work around did indeed work for the module I was working on, i.e. it compiles and runs with the latest version of ifort, and also with -debug options activated. However, when I tried to compile another module that uses the module above, it gives the catastrophic error again...even if this module does not have select type kind of functions. This new module extends one of the types in the original module (imagine a new module that extends the types in the reproducer).

The (kind of) good news is that I have a new workaround, which seems to be working so far: I basically scrapped all the polymorphism from the code, and made each class (Tensor2 and Tensor3 in the reproducer above) an independent and unique class -- that is, I am duplicating code like crazy, with all the interface of the virtual Tensor class now copy/pasted in each derived class and implemented with specific code, without ever using select type or anything like that.

At this point, the very useful concept of inheritance in Fortran has been brought down to its knees...I will keep my old polymorphic code in a separate branch for when this gets fixed, because the code that runs is UGLY :)

Hopefully, this issue is kind of big and gets fixed soon.

Thanks for your help and best regards,

Fernando

Kevin_D_Intel · ‎01-14-2011

The Developer confirmed a fix he completed yesterday corrects the internal error in your case and other similar reports. Both your original and latest reduced test case compile successfully.More internal testing is needed andthe cutoff for the coming Fortran Composer XE 2011 Update 2has already passedso the assuming the fix checks out OK, it should appear in Update 3 (tentatively in the March time-frame). Unfortunately there is no work around.

I will update again when I know more.

Kevin_D_Intel · ‎04-27-2011

Pardon the delayed update. This defect is fixed in the Intel Fortran Composer XE 2011 Update 3 (2011.3.167 - Mac OS)