Dear all, we recently moved to version 2018 of the Fortran compiler. We found one instance where we were not able to figure out why we encounter a problem with the 2018 version. The code this happens in is very big, so I went through a procedure to strip this code down to the bare essentials. I post it here to get some feedback whether you think there may be another problem. If not I will submit a bug report. I have attached a tar file with the source files (a few small ones), a Makefile and a script (adapt.sh) that compiles the code and then runs it 4 times.
The correct result obtained with the 2017u4 version is attached (output_2017u4.txt). The output of the 2018 version is in output_2018u1. The code not only segfaults every single time, it also gives inconsistent results and it won't give a proper traceback.
For all curious people, this sample code solves a linear system using a bicgstab Krylov solver using reverse communication.
Thanks for all your advice in advance.
Line 83 of KKT.f90 refers to a dangling pointer. At that location, v and w are undefined and unassociated. When a program contains undefined variables, run time behavior is also "undefined".
To add something else: I just tested gfortran on my mac: this gives the same as the 2017 intel compiler. This may strengthen my statement on something fishy in 2018 version. I realise the code is bigger than optimal to easily look at but I think the call to the solver_revcom is probably essential in generating the bug.
Hi Danny, just being curious I tested your code on Windows OS with 16.0.3, 17.0.4, 18.0.0 and 18.0.1 as x64 debug builds as in your makefile.
All compiler versions generate a floating point overflow:
forrtl: error (72): floating overflow Image PC Routine Line Source adapt_code.exe 000000013F052CF1 BICGSTAB_mp_BICGS 82 bicgstab.f90 adapt_code.exe 000000013F05A9C3 SOLVER_mp_SOLVER_ 46 solver.f90 adapt_code.exe 000000013F05BBFC KKT_mp_SOLVE_KKT 51 kkt.f90 adapt_code.exe 000000013F05E2D4 MAIN__ 17 main.f90 adapt_code.exe 000000013F0FB952 Unknown Unknown Unknown adapt_code.exe 000000013F0FC5B4 Unknown Unknown Unknown adapt_code.exe 000000013F0FC4C7 Unknown Unknown Unknown adapt_code.exe 000000013F0FC38E Unknown Unknown Unknown adapt_code.exe 000000013F0FC5C9 Unknown Unknown Unknown kernel32.dll 0000000076B859CD Unknown Unknown Unknown ntdll.dll 0000000076DBA561 Unknown Unknown Unknown
rnorm = sqrt(dot_product(bicgstab%r,bicgstab%r)) ! bicgstab%r(1:) = 1.9983972E+18
Adding -real_size:64 to extend single to double precisions avoids the overflow error. 18.0.1 then breaks without generating traceback information... 16.0.3 and 17.0.4 runs fine. Further stepping from line to line in debugger the 18 compiler family let me not step through the lines in init_bicgstab (lines 183...194 in file bicgstab). Whatever this means.
Hopefully not another regression of PSXE18! Better you file this issue at OSC.
Hi here for completeness the outputs of the terminal for the first run:
PSXE 17 update 4
KKT solver step 0 error 1.00000E+00 conv_rate 1.00000E+00 KKT solver step 1 error 6.58145E+00 conv_rate 6.58145E+00 [solve_kkt] group iteration did not converge within 1 steps
PSXE 18 update 1
KKT solver step 0 error 1.45682+144 conv_rate 1.00000E+00
The initial error is screwed somehow in 18 with Windows OS. After this the program breaks with an debug window error. Hopefully this helps somehow.
Hello Johannes, Thanks for your comments. I missed that you commented so did not reply earlier. In fact we have now been able to solve this issue. We had to change the pointer declarations of v and w in solver_revcom and in bicgstab_revcom to intent(inout) instead of intent(out). Not sure if I understand this, but apparently pointer cleanup is different in 2018 (and possibly flawed?). Anyway we can get back to work :-)
Best regards, Danny
Hi Danny, good to hear, that you have solved your issues. Sometimes newer compilers are more restrictive. Maybe the standard leaves room to handle pointers in different ways and 18 series choose a different approach. But that is just speculation.
I remember darkly a discussion on intent use with pointers on comp.lang.fortran:
Maybe you find useful information there. The original poster could solve his issue also by changing the intent from out to inout. Steve Lionel mentioned that this is bug in the compiler. However, I've not read everything in detail. So it might be a completely different issue.
Happy coding, Johannes
Danny L. wrote:
Thanks mecej4, but v and w are set in the call to solver_revcom so at line 83 they are perfectly well defined and associated.
Only in certain cases. Consider the case when bicgstab%jmp = 2 when Subroutine bicgstab_revcom is called. Because v and w are declared INTENT(OUT), they become undefined when the subroutine is entered (the compiler is not required to make this happen, however). The section of the code corresponding to bicgstab%jmp = 2 sets some components of bicgstab and cmd, but v and w are not set before RETURN.
I have long felt that this aspect of INTENT(OUT), that is, making the variable undefined, even if it had a perfectly good value before subprogram entry and was never touched before returning, is counter-intuitive and an unpleasant surprise to new users of Fortran 90+. The rule is: "If you declare INTENT(OUT), you must define the variable before leaving the subroutine. I find it helpful to tell myself, "Intent(Out) can mean Intent(Destroy)".
I see what you mean mecej4, but in that case the TEST_ERROR section in solve_kkt is executed and the v and w are not used. So, this issue really never occurs. But you make a valid point of neatness. I think this is related to using reverse communication structures like this.
Danny, the risk is that even when v and w are not used, if they are subprogram arguments with INTENT(OUT), they may become undefined. I put together a test program to illustrate this point.
program xintent ! ! illustrate the effect of INTENT(OUT) ! implicit none integer, pointer :: ip(:) integer, target :: i(2) ! i = 2 ip=>i call sub(ip,i) ! ip is associated and initialized before call print*, ip ! not valid, since ip became undefined at entry to SUB stop contains subroutine sub(ip,i) implicit none integer, intent(in),target :: i(:) integer, pointer, intent(out) :: ip(:) if(any(i == 5))ip=>i ! pointer assignment not executed since no element of ip equals 5 return end subroutine end program
The trouble with such programs is that few compilers help you to catch the bug related to IP becoming undefined merely because the subroutine was entered.