Different compiler output for same source. Cause not clear

Charles_S_5 · ‎10-28-2015

Hi,

Recently I built an Intel Visual Fortran project and found at a certain point in the program (at the moment a LEAVE asm statement is called) many variables become corrupted. I could not see why this was happening as I had made no changes to the source.

Eventually I decompiled a previous build of the same source and compared the assembly to the current build. I noticed some MOV EAX,nn calls in the working version were replaced with JMP calls in the broken version. I have attached a screenshot of these differences

I would appreciate any advice as to why this is happening because I can find no other cause for the issue I am experiencing.

Edit: The binaries were decompiled with OllyDBG.

andrew_4619 · ‎10-29-2015

what compiler version are you using?

Have you changed compiler version? Have you changed compiler options?

I think you need to be running the program in the debugger. the most likely cause of the corruption is bugs in the source.

jimdempseyatthecove · ‎10-29-2015

The LEAVE instruction is similar to the RETN in that both return from a function call. The LEAVE instruction is used when the function was compiled (intended) to use the ENTER instruction as opposed to the CALL instruction.

Generally an application uses one calling convention or the other, not both. What are the two compiler options?

Can you also show the statements relating to your screenshot.

Are you looking at the local values inside the subroutine returned? (out of scope)

Are you looking at changes in variables who's references are now out of scope?

Jim Dempsey

mecej4 · ‎10-29-2015

When instructions such as RET nn, LEAVE, etc., are executed, the stack pointer ESP is altered. Therefore, local variables, which are usually allocated on the stack, go out of scope and a symbolic debugger may display junk values for those variables. You should refrain from drawing conclusions about what your program is doing based on what you see at the assembly level, especially if you are debugging optimized code. One exception is in the context of locating a suspected compiler bug.

Charles_S_5 · ‎10-29-2015

I'm running Parallel Studio XE 2016 Composer Edition downloaded on 16th October 2015.
I don't believe I changed any compiler options between the two builds.
The variables are defined in a module which is then USEd in related code.
This doesn't appear to be a scoping issue. Prior to a CALL statement the variable value is correct. After the LEAVE statement it is corrupt (changed).
I actually noticed this in the Visual Studio debugger and only after there was no clear cause did I step into the assembly, which allowed me to narrow this down to the LEAVE call and eventually the RET/MOV assembly differences.
I don't think I can show the statements corresponding to this assembly. This RET/MOV differences occur very early in the disassembly and I can't correlate it to a method in my source.

I was able to briefly fix this by changing some "LOGICAL" declarations to "INTEGER", however if I then delete all compiled output, restore the source code and do the exact same thing, it didn't work the second time. This leaves me very confused.

Compiler options:

/nologo /debug:full /Od /I"Includes" /Qinit:zero /Qinit:arrays /module:"Debug\\" /object:"Debug\\" /Fd"Debug\vc140.pdb" /traceback /check:bounds /check:stack /libs:static /threads /c

Linker options:

/OUT:"Debug\redacted.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Debug\redacted.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\Users\redacted\Debug\redacted.pdb" /SUBSYSTEM:CONSOLE,"5.01" /IMPLIB:"C:\Users\redacted\Debug\redacted.lib" F90SQL.lib

jimdempseyatthecove · ‎10-29-2015

Look at the assembler code just prior to the CALL. If you do not see ENTER, when the called routine is using LEAVE, then the calling program/routine and the called routine are likely not using the same calling convention.

Jim Dempsey

Charles_S_5 · ‎10-29-2015

Jim, thanks for your message. I do not see any ENTER calls. Does the below code give any indication of the calling convention I should be using? I'm not familiar with the different calling conventions.

Edit: I just tried every calling convention option and the corruption was still present in all modes, with the exception of STDCALL which caused compiler errors (unresolved external symbol _executestatement@16)

Example calling code:

CALL GetRows(StmtHndl,iRet)
01049B1E  add         esp,0FFFFFFF8h  
01049B21  lea         eax,[STMTHNDL]  
01049B27  mov         dword ptr [esp],eax  
01049B2A  lea         eax,[IRET]  
01049B30  mov         dword ptr [esp+4],eax  
01049B34  call        GETROWS (010606CCh)  
01049B39  add         esp,8

Example called code (memory corruption occurs after the LEAVE statement towards the end):

SUBROUTINE GetRows(StmtHndl,iRet)
010606CC  push        ebp  
010606CD  mov         ebp,esp  
010606CF  sub         esp,3Ch  
010606D2  push        eax  
010606D3  push        edi  
010606D4  push        ecx  
010606D5  mov         edi,ebp  
010606D7  sub         edi,3Ch  
010606DA  mov         ecx,0Fh  
010606DF  mov         eax,0CCCCCCCCh  
010606E4  rep stos    dword ptr es:[edi]  
010606E6  pop         ecx  
010606E7  pop         edi  
010606E8  pop         eax  
	use f90SQLConstants
	use f90SQLStructures
	use f90SQL
    
	integer(SQLHSTMT_KIND):: StmtHndl
	integer(SQLRETURN_KIND)::iRet
    
    call f90SQLFetchScroll(StmtHndl,SQL_FETCH_NEXT, int(0,SQLINTEGER_KIND),iRet)
010606E9  mov         eax,dword ptr [STMTHNDL]  
010606EC  mov         dword ptr [esp],eax  
010606EF  mov         dword ptr [esp+4],114C460h  
010606F7  mov         dword ptr [esp+8],114C45Ch  
010606FF  mov         eax,dword ptr [IRET]  
01060702  mov         dword ptr [esp+0Ch],eax  
01060706  call        _F90SQLFETCHSCROLL@16 (010653EAh)  
    
	Return
0106070B  mov         dword ptr [ebp-4],0  
	End
01060712  mov         eax,114C440h  
01060717  mov         edx,eax  
01060719  mov         ecx,ebp  
0106071B  push        edx  
	End
0106071C  push        eax  
0106071D  call        _RTC_CheckStackVars (010F6798h)  
01060722  pop         eax  
01060723  pop         edx  
01060724  add         esp,2Ch  
01060727  cmp         ebp,esp  
01060729  call        _RTC_CheckEsp (010F67F0h)  
0106072E  leave  
0106072F  ret

Charles_S_5 · ‎10-29-2015

If I take a working build, delete the compiled executable and re link the existing obj files, a broken build is produced. This seems to suggest this is linker-related. However I don't have enough insight into Intel Visual Fortran to find the root cause of the issue. Any further help is appreciated.

mecej4 · ‎10-29-2015

Please give us a complete, self-contained reproducer. Looking solely at fragments of disassembled code is not going to help us get much done.

Charles_S_5 · ‎10-29-2015

That may be a bit difficult, this software connects to a database and processes proprietary data. However I can pull further code samples with appropriate redactions. I hope that having narrowed it down to the linker might shed some light on the issue. I will keep digging into this but eagerly await any further advise.

mecej4 · ‎10-29-2015

The redacted code does not have to do anything useful. It can produce silly results and do useless calculations, since its sole purpose is to help find a bug in the compiler. Try commenting out the database calls (ODBC function calls? embedded SQL?) and substituting fake data for your proprietary data.

You can see an example of how the process works at http://forums.silverfrost.com/viewtopic.php?t=2465&postdays=0&postorder=asc&start=0 .

Charles_S_5 · ‎10-29-2015

The variable corruption doesn't occur if I comment out the database calls since they comprise almost the entire application. However I hope there may be others on this forum who can look at the data I have posted and provide some insight before going down that route. The dozen or so MOV / JMP instructions are literally the only difference between a working and faulty executable.

IanH · ‎10-29-2015

A few points:

- How are you determining that the variables are corrupted? If you are just using the debugger, be aware that it can get confused. Confirm what the debugger is telling you by writing values to a file or similar. Note mecej4's points in post #4 if you are watching variable values instruction by instruction - you should expect strangeness when instructions (they are not statements) change the stack pointer, because the stack pointer is one of the means by which the debugger knows where the variable is in memory.

- I don't see anything in the options listed in #5 that would enable stdcall calling convention, yet there are clearly stdcall procedures being invoked (from the @nn suffix on the symbol names). The disassembly in the opening post also looks like stdcall, with the RET nn instructions. The GetRows procedure in #7 appears to be cdecl, but then it calls stdcall procedures. As long as both caller scope and callee are clear on the calling convention in use, you can quite happily have a mix of calling conventions in a program, but if caller and callee don't agree, then many things can go astray. (It is possible a mix of calling conventions may also confuse the debugger.) CVF and IVF have different defaults in this regard, different compiler options would also change the default convention, so be mindful of left over bits from a previous build where things may have been different.

- LEAVE and ENTER are instructions executed as part of the function prologue and epilogue, that establish and then tear down the stack frame. Their effect may be accomplished, perhaps more often than not by a series of "simpler" instructions (because the equivalent series of instructions is often faster!), hence you may well see one without seeing the other. They are not replacements for CALL and RET.

- Code that requires /Qinit:zero is code that hasn't done a proper job of initially defining variables. The code should be made to go and sit in the corner until it initially defines its variables correctly.

Charles_S_5 · ‎10-29-2015

In regards to how I determined the variable was corrupt - initially I ran the program and noticed it was entering an IF block based on a variable which should have been false. I then stepped through the code in visual studio and noticed the variable was changing after returning from a function. Since this made no sense, I went further with the disassembly and narrowed it down to the LEAVE instruction. All inspections of the variable were done at an appropriate scope where the variable should have been valid.

jimdempseyatthecove · ‎10-30-2015

Try this

Prior to making the call the database function/subroutine that produces the symptom:

a) Set a Watch to the variable that changes
b) Open a Memory window and set the memory window to the same variable. Note you may need to use LOC(variableNameHere)

Now step over the call

If the Watch window changes and the Memory window does not, then your calling convention to the database is incorrect. This indicates the stack frame was not restored properly across the call.

If both change, then:

a) the calling conventions between the Fortran side and the database side are likely not agreeing on reference or value
b) the database is expecting a C-style NULL terminated string. Fortran does not use NULL terminated strings.
c) The size of an output array (passed to the database) may be incorrect or misunderstood (this may be a result of a) where the reference of the size of the array is passed as opposed to the value of the size of the array)

There may be others

Jim Dempsey

JVanB · ‎10-30-2015

The stuff about the LEAVE instruction is a bunch of hogwash. See http://support.amd.com/TechDocs/40546.pdf , section 4.9. LEAVE is just being used as a way to clean up the frame pointer; no ENTER is required or generally desirable.

Charles_S_5 · ‎11-01-2015

Jim,

Your advice about observing the state in a Watch window and a Memory window was key to solving this problem.

Once I had the Memory window open I saw that a large area of memory was being overwritten by a long sequence of integers. I also noticed this happened immediately after f90SQLFetchScroll was called, as opposed to the LEAVE statement as I had observed earlier. With that information I discovered that an array which tracks the status of the returned database rows was being overrun. In other words it was a simple programming error exactly as you predicted.

The solution was to change the row status array of kind SQLUSMALLINT_KIND to SQLUINTEGER_KIND. I believe this issue occurred because the application was built for a very old ODBC driver which returned unsigned shorts while a modern ODBC driver, such as the one on my workstation, returns unsigned integers.

I don't know why the only difference between a working and broken build was the MOV/JMP statements. I also don't know how I was initially able to produce a working build without having fixed this bug. I'll need to do some further investigation to make sense of that and to confirm that the problem is fixed. I hope this resolution helps anyone facing a similar issue when migrating f90SQL to a modern machine.

To all,

I appreciate your guidance and your patience. I wouldn't have figured this one out without your insight.

Regards,

Charles

jimdempseyatthecove · ‎11-01-2015

Glad to help. It is good that you were able to find the "There may be others".

RE: "I also don't know how I was initially able to produce a working build without having fixed this bug"

There is a difference between a build that runs and a build that runs correctly.

Jim Dempsey

andrew_4619 · ‎11-01-2015

Charles S. wrote:
I also don't know how I was initially able to produce a working build without having fixed this bug.

As Jim intimates ( and from my own experience) often from one build to another with some small changes the memory organisation is different so you overwrite different stuff (non-critical/ already used and no longer needed) and the program can "run correctly".

Does f90SQL not have USE module files for interface checking BTW? That would have flagged the error.