Version and optimization dependent segfault

Van_Veen__Lennaert · ‎10-21-2014

I have a segfault I would appreciate some help with. A nearly minimal code that reproduces it is attached.
The background is, that I am developing a code that handles big matrices, which should be distributed over CPUs along one index (labeled z in the example). I want to determine the distribution during run time, based on the number of prcosses as returned by an MPI routine. The way I have set it up it to have a module "global", that all other modules use, with some auxiliary variables related to the partitioning in it. In the main program I then obtain the number of processes and allocate these variables (in the example code only ny and nz, integers that appear in loop bounds, and kz, an allocatable array). Note that I have removed all MPI-related code from the example, setting nprocs and myrank by a simple assignment.

When I compile the attached code on our small cluster, running Linux version 2.6.18-164.11.1.el5 (Red Hat 4.1.2-46) and ifort version 11.1, I find that
* with optimization -O1 and -O2 the code runs and terminates cleanly;
* with optimization -O3 I get:
> ifort -O3 -traceback -o test.x DNS_int.f90
> ./test.x

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
test.x             0000000000403062 hit3d_mp_rhs3_             44 hit3d.f90
test.x             0000000000402EC0 hit3d_mp_rhs_              22 hit3d.f90
test.x             0000000000402C53 MAIN__                     33 DNS_int.f90
test.x             0000000000402ACC Unknown               Unknown Unknown
libc.so.6          0000003A9B01D994 Unknown               Unknown Unknown
test.x             00000000004029D9 Unknown               Unknown Unknown

It would seem that the root cause is the way that kz is handled. If I declare it just like kx and ky, rather than dynamically, the segfault disappears. That would not be a solution, though, as I need to allocate it dynamically.
My questions:
1) Is the construction I use correct? If not, please suggest a correct way to do this (to allocate kz based on a value of nprocs determined during runtime).
2) If it is correct, then is this a compiler bug? Is there a work-around that keeps my code portable and the executable near-optimal?

Two more observations that may be relevant:
When I add compiler flags sometimes the segfault goes away. For instance, combining -O3 with any of the following: -check pointers, -check bounds, -check uninit, -no-vec makes the segfault disappear,
When I compile the code on my laptop, running Linux version 3.11.0-26-generic (Ubuntu 13.10) with ifort 12.1.0, there is no segfault at all at any optimization level.

Any help with this would be greatly appreciated.

pbkenned1 · ‎10-22-2014

I haven't studied the code, but possibly this is a bug in the 11.1 compiler. The 15.0 compiler has no issue:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.090 Build 20140723
Copyright (C) 1985-2014 Intel Corporation. All rights reserved.

$ ifort -O3 -traceback -o test.x DNS_int.f90
$ ./test.x
$

Patrick

Van_Veen__Lennaert · ‎10-22-2014

It would be great if a specialist could confirm that this is an ifort bug. I just want to make sure it is not a mistake in my code.
Also, since I cannot change the version of ifort on the cluster, a safe work-around for version 11.1 would be very helpful.
Thanks for the check.

pbkenned1 · ‎10-22-2014

What exact version of ifort 11.1 are you using (ie, the output of ifort -V)? I did try the last 11.1 version, and your example worked normally. I'd be happy to determine if this is really an ifort bug or not, but I need to be able to reproduce the SEGV.

Patrick

jimdempseyatthecove · ‎10-22-2014

Not that this matters with nprocs=1 but...

complex(kind=8), dimension(0:n/2,0:n-1,0:n-1) :: A,B,FA,FB
...
subroutine RHS(A,B,RA,RB)
complex(kind=8), intent(in), dimension(0:n/2,0:n-1,0:nz-1) :: A,B
complex(kind=8), intent(out), dimension(0:n/2,0:n-1,0:nz-1) :: RA,RB

Jim Dempsey

pbkenned1 · ‎10-22-2014

I don't spot any coding errors. I think this is just an -O3 optimization bug in 11.1, since it works at -O2 with that version, or at -O3 with any other major compiler version I tested (11.1.080, 12.1.7.367, 13.1.3.192, 14.0.4.211, 15.0.0.090).

Patrick

Van_Veen__Lennaert · ‎10-22-2014

The output of ifort -V:

Intel(R) Fortran Intel(R) 64 Compiler Professional for applications running on Intel(R) 64, Version 11.1 Build 20091130 Package ID: l_cprof_p_11.1.064
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.

And /proc/version reads:

Linux version 2.6.18-164.11.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Wed Jan 20 07:32:21 EST 2010

As for Jim Dempsey's comment: in the actual program these arrays do not occur in the main program, but I cut out several layers to narrow down the possible causes. The segfaults stays if I use n instead of nz to set the dimensions in subroutines RHS and RHS3. I suppose that means that kz is the root cause, not the array bounds. Thanks!

pbkenned1 · ‎10-23-2014

It's an -O3 unroll/jam defect in ifort-11.1.064. You can workaround it with -unroll0:

[U533981]$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Professional for applications running on Intel(R) 64, Version 11.1 Build 20091130 Package ID: l_cprof_p_11.1.064
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.

[U533981]$ ifort -O3 -traceback -o test.x DNS_int.f90 -unroll0
[U533981]$ ./test.x
[U533981]$

Patrick

Van_Veen__Lennaert · ‎10-23-2014

Thank you very much for sorting this out! I can move on with the project now, and I do not think the unrolling will impact significantly on the run time.

pbkenned1 · ‎10-24-2014

Thanks for the feedback, I'll consider this case closed then. I'll note in closing that -unroll0 only needs to be applied to hit3d.f90. You had included the file in DNS_int.f90. I commented out the include, and compiled hit3d separately to debug the issue. Of course, the unroll issue arises in the code generated for RA(kx_,ky_,kz_)=RA(kx_,ky_,kz_)-kz(kz_)*kx(kx_)*UU(kx_,ky_,kz_). As long as that statement is not a hotspot for your real application, the performance hit from applying -unroll0 probably won't be noticed.

Patrick