Possible compiler bug 2018 version (linux)

Danny_L_ · ‎01-09-2018

Dear all, we recently moved to version 2018 of the Fortran compiler. We found one instance where we were not able to figure out why we encounter a problem with the 2018 version. The code this happens in is very big, so I went through a procedure to strip this code down to the bare essentials. I post it here to get some feedback whether you think there may be another problem. If not I will submit a bug report. I have attached a tar file with the source files (a few small ones), a Makefile and a script (adapt.sh) that compiles the code and then runs it 4 times.

The correct result obtained with the 2017u4 version is attached (output_2017u4.txt). The output of the 2018 version is in output_2018u1. The code not only segfaults every single time, it also gives inconsistent results and it won't give a proper traceback.

For all curious people, this sample code solves a linear system using a bicgstab Krylov solver using reverse communication.

Thanks for all your advice in advance.

Danny

mecej4 · ‎01-09-2018

Line 83 of KKT.f90 refers to a dangling pointer. At that location, v and w are undefined and unassociated. When a program contains undefined variables, run time behavior is also "undefined".

Danny_L_ · ‎01-09-2018

Thanks mecej4, but v and w are set in the call to solver_revcom so at line 83 they are perfectly well defined and associated.

Danny_L_ · ‎01-10-2018

To add something else: I just tested gfortran on my mac: this gives the same as the 2017 intel compiler. This may strengthen my statement on something fishy in 2018 version. I realise the code is bigger than optimal to easily look at but I think the call to the solver_revcom is probably essential in generating the bug.

Best, Danny

Johannes_Rieke · ‎01-10-2018

Hi Danny, just being curious I tested your code on Windows OS with 16.0.3, 17.0.4, 18.0.0 and 18.0.1 as x64 debug builds as in your makefile.

All compiler versions generate a floating point overflow:

forrtl: error (72): floating overflow
Image              PC                Routine            Line        Source
adapt_code.exe     000000013F052CF1  BICGSTAB_mp_BICGS          82  bicgstab.f90
adapt_code.exe     000000013F05A9C3  SOLVER_mp_SOLVER_          46  solver.f90
adapt_code.exe     000000013F05BBFC  KKT_mp_SOLVE_KKT           51  kkt.f90
adapt_code.exe     000000013F05E2D4  MAIN__                     17  main.f90
adapt_code.exe     000000013F0FB952  Unknown               Unknown  Unknown
adapt_code.exe     000000013F0FC5B4  Unknown               Unknown  Unknown
adapt_code.exe     000000013F0FC4C7  Unknown               Unknown  Unknown
adapt_code.exe     000000013F0FC38E  Unknown               Unknown  Unknown
adapt_code.exe     000000013F0FC5C9  Unknown               Unknown  Unknown
kernel32.dll       0000000076B859CD  Unknown               Unknown  Unknown
ntdll.dll          0000000076DBA561  Unknown               Unknown  Unknown

rnorm = sqrt(dot_product(bicgstab%r,bicgstab%r)) ! bicgstab%r(1:) = 1.9983972E+18

Adding -real_size:64 to extend single to double precisions avoids the overflow error. 18.0.1 then breaks without generating traceback information... 16.0.3 and 17.0.4 runs fine. Further stepping from line to line in debugger the 18 compiler family let me not step through the lines in init_bicgstab (lines 183...194 in file bicgstab). Whatever this means.

Hopefully not another regression of PSXE18! Better you file this issue at OSC.

Regards, Johannes

Johannes_Rieke · ‎01-10-2018

Hi here for completeness the outputs of the terminal for the first run:

PSXE 17 update 4

    KKT solver step   0 error     1.00000E+00 conv_rate     1.00000E+00
    KKT solver step   1 error     6.58145E+00 conv_rate     6.58145E+00
[solve_kkt] group iteration did not converge within      1 steps

PSXE 18 update 1

    KKT solver step   0 error     1.45682+144 conv_rate     1.00000E+00

The initial error is screwed somehow in 18 with Windows OS. After this the program breaks with an debug window error. Hopefully this helps somehow.

Danny_L_ · ‎01-12-2018

Hello Johannes, Thanks for your comments. I missed that you commented so did not reply earlier. In fact we have now been able to solve this issue. We had to change the pointer declarations of v and w in solver_revcom and in bicgstab_revcom to intent(inout) instead of intent(out). Not sure if I understand this, but apparently pointer cleanup is different in 2018 (and possibly flawed?). Anyway we can get back to work :-)

Best regards, Danny

Johannes_Rieke · ‎01-12-2018

Hi Danny, good to hear, that you have solved your issues. Sometimes newer compilers are more restrictive. Maybe the standard leaves room to handle pointers in different ways and 18 series choose a different approach. But that is just speculation.

I remember darkly a discussion on intent use with pointers on comp.lang.fortran:

https://groups.google.com/forum/#!topic/comp.lang.fortran/zWrmsDgJaLc

Maybe you find useful information there. The original poster could solve his issue also by changing the intent from out to inout. Steve Lionel mentioned that this is bug in the compiler. However, I've not read everything in detail. So it might be a completely different issue.

Happy coding, Johannes

Danny_L_ · ‎01-12-2018

Good to see that link which indeed appears similar. I am already happy with this workaround for now.

mecej4 · ‎01-12-2018

Danny L. wrote:

Thanks mecej4, but v and w are set in the call to solver_revcom so at line 83 they are perfectly well defined and associated.

Only in certain cases. Consider the case when bicgstab%jmp = 2 when Subroutine bicgstab_revcom is called. Because v and w are declared INTENT(OUT), they become undefined when the subroutine is entered (the compiler is not required to make this happen, however). The section of the code corresponding to bicgstab%jmp = 2 sets some components of bicgstab and cmd, but v and w are not set before RETURN.

I have long felt that this aspect of INTENT(OUT), that is, making the variable undefined, even if it had a perfectly good value before subprogram entry and was never touched before returning, is counter-intuitive and an unpleasant surprise to new users of Fortran 90+. The rule is: "If you declare INTENT(OUT), you must define the variable before leaving the subroutine. I find it helpful to tell myself, "Intent(Out) can mean Intent(Destroy)".

Danny_L_ · ‎01-12-2018

I see what you mean mecej4, but in that case the TEST_ERROR section in solve_kkt is executed and the v and w are not used. So, this issue really never occurs. But you make a valid point of neatness. I think this is related to using reverse communication structures like this.

mecej4 · ‎01-12-2018

Danny, the risk is that even when v and w are not used, if they are subprogram arguments with INTENT(OUT), they may become undefined. I put together a test program to illustrate this point.

program xintent
!
! illustrate the effect of INTENT(OUT)
!
implicit none
integer, pointer :: ip(:)
integer, target  :: i(2)
!
i = 2
ip=>i

call sub(ip,i)  ! ip is associated and initialized before call

print*, ip      ! not valid, since ip became undefined at entry to SUB
stop

contains

subroutine sub(ip,i)
implicit none
integer, intent(in),target :: i(:)
integer, pointer, intent(out) :: ip(:)

if(any(i == 5))ip=>i      ! pointer assignment not executed since no element of ip equals 5

return
end subroutine
end program

The trouble with such programs is that few compilers help you to catch the bug related to IP becoming undefined merely because the subroutine was entered.

Danny_L_ · ‎01-12-2018

I see your point mecej4