- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Recently I built an Intel Visual Fortran project and found at a certain point in the program (at the moment a LEAVE asm statement is called) many variables become corrupted. I could not see why this was happening as I had made no changes to the source.
Eventually I decompiled a previous build of the same source and compared the assembly to the current build. I noticed some MOV EAX,nn calls in the working version were replaced with JMP calls in the broken version. I have attached a screenshot of these differences
I would appreciate any advice as to why this is happening because I can find no other cause for the issue I am experiencing.
Edit: The binaries were decompiled with OllyDBG.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
what compiler version are you using?
Have you changed compiler version? Have you changed compiler options?
I think you need to be running the program in the debugger. the most likely cause of the corruption is bugs in the source.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The LEAVE instruction is similar to the RETN in that both return from a function call. The LEAVE instruction is used when the function was compiled (intended) to use the ENTER instruction as opposed to the CALL instruction.
Generally an application uses one calling convention or the other, not both. What are the two compiler options?
Can you also show the statements relating to your screenshot.
Are you looking at the local values inside the subroutine returned? (out of scope)
Are you looking at changes in variables who's references are now out of scope?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When instructions such as RET nn, LEAVE, etc., are executed, the stack pointer ESP is altered. Therefore, local variables, which are usually allocated on the stack, go out of scope and a symbolic debugger may display junk values for those variables. You should refrain from drawing conclusions about what your program is doing based on what you see at the assembly level, especially if you are debugging optimized code. One exception is in the context of locating a suspected compiler bug.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- I'm running Parallel Studio XE 2016 Composer Edition downloaded on 16th October 2015.
- I don't believe I changed any compiler options between the two builds.
- The variables are defined in a module which is then USEd in related code.
- This doesn't appear to be a scoping issue. Prior to a CALL statement the variable value is correct. After the LEAVE statement it is corrupt (changed).
- I actually noticed this in the Visual Studio debugger and only after there was no clear cause did I step into the assembly, which allowed me to narrow this down to the LEAVE call and eventually the RET/MOV assembly differences.
- I don't think I can show the statements corresponding to this assembly. This RET/MOV differences occur very early in the disassembly and I can't correlate it to a method in my source.
I was able to briefly fix this by changing some "LOGICAL" declarations to "INTEGER", however if I then delete all compiled output, restore the source code and do the exact same thing, it didn't work the second time. This leaves me very confused.
Compiler options:
/nologo /debug:full /Od /I"Includes" /Qinit:zero /Qinit:arrays /module:"Debug\\" /object:"Debug\\" /Fd"Debug\vc140.pdb" /traceback /check:bounds /check:stack /libs:static /threads /c
Linker options:
/OUT:"Debug\redacted.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Debug\redacted.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\Users\redacted\Debug\redacted.pdb" /SUBSYSTEM:CONSOLE,"5.01" /IMPLIB:"C:\Users\redacted\Debug\redacted.lib" F90SQL.lib
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look at the assembler code just prior to the CALL. If you do not see ENTER, when the called routine is using LEAVE, then the calling program/routine and the called routine are likely not using the same calling convention.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, thanks for your message. I do not see any ENTER calls. Does the below code give any indication of the calling convention I should be using? I'm not familiar with the different calling conventions.
Edit: I just tried every calling convention option and the corruption was still present in all modes, with the exception of STDCALL which caused compiler errors (unresolved external symbol _executestatement@16)
Example calling code:
CALL GetRows(StmtHndl,iRet) 01049B1E add esp,0FFFFFFF8h 01049B21 lea eax,[STMTHNDL] 01049B27 mov dword ptr [esp],eax 01049B2A lea eax,[IRET] 01049B30 mov dword ptr [esp+4],eax 01049B34 call GETROWS (010606CCh) 01049B39 add esp,8
Example called code (memory corruption occurs after the LEAVE statement towards the end):
SUBROUTINE GetRows(StmtHndl,iRet) 010606CC push ebp 010606CD mov ebp,esp 010606CF sub esp,3Ch 010606D2 push eax 010606D3 push edi 010606D4 push ecx 010606D5 mov edi,ebp 010606D7 sub edi,3Ch 010606DA mov ecx,0Fh 010606DF mov eax,0CCCCCCCCh 010606E4 rep stos dword ptr es:[edi] 010606E6 pop ecx 010606E7 pop edi 010606E8 pop eax use f90SQLConstants use f90SQLStructures use f90SQL integer(SQLHSTMT_KIND):: StmtHndl integer(SQLRETURN_KIND)::iRet call f90SQLFetchScroll(StmtHndl,SQL_FETCH_NEXT, int(0,SQLINTEGER_KIND),iRet) 010606E9 mov eax,dword ptr [STMTHNDL] 010606EC mov dword ptr [esp],eax 010606EF mov dword ptr [esp+4],114C460h 010606F7 mov dword ptr [esp+8],114C45Ch 010606FF mov eax,dword ptr [IRET] 01060702 mov dword ptr [esp+0Ch],eax 01060706 call _F90SQLFETCHSCROLL@16 (010653EAh) Return 0106070B mov dword ptr [ebp-4],0 End 01060712 mov eax,114C440h 01060717 mov edx,eax 01060719 mov ecx,ebp 0106071B push edx End 0106071C push eax 0106071D call _RTC_CheckStackVars (010F6798h) 01060722 pop eax 01060723 pop edx 01060724 add esp,2Ch 01060727 cmp ebp,esp 01060729 call _RTC_CheckEsp (010F67F0h) 0106072E leave 0106072F ret
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I take a working build, delete the compiled executable and re link the existing obj files, a broken build is produced. This seems to suggest this is linker-related. However I don't have enough insight into Intel Visual Fortran to find the root cause of the issue. Any further help is appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please give us a complete, self-contained reproducer. Looking solely at fragments of disassembled code is not going to help us get much done.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That may be a bit difficult, this software connects to a database and processes proprietary data. However I can pull further code samples with appropriate redactions. I hope that having narrowed it down to the linker might shed some light on the issue. I will keep digging into this but eagerly await any further advise.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The redacted code does not have to do anything useful. It can produce silly results and do useless calculations, since its sole purpose is to help find a bug in the compiler. Try commenting out the database calls (ODBC function calls? embedded SQL?) and substituting fake data for your proprietary data.
You can see an example of how the process works at http://forums.silverfrost.com/viewtopic.php?t=2465&postdays=0&postorder=asc&start=0 .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The variable corruption doesn't occur if I comment out the database calls since they comprise almost the entire application. However I hope there may be others on this forum who can look at the data I have posted and provide some insight before going down that route. The dozen or so MOV / JMP instructions are literally the only difference between a working and faulty executable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A few points:
- How are you determining that the variables are corrupted? If you are just using the debugger, be aware that it can get confused. Confirm what the debugger is telling you by writing values to a file or similar. Note mecej4's points in post #4 if you are watching variable values instruction by instruction - you should expect strangeness when instructions (they are not statements) change the stack pointer, because the stack pointer is one of the means by which the debugger knows where the variable is in memory.
- I don't see anything in the options listed in #5 that would enable stdcall calling convention, yet there are clearly stdcall procedures being invoked (from the @nn suffix on the symbol names). The disassembly in the opening post also looks like stdcall, with the RET nn instructions. The GetRows procedure in #7 appears to be cdecl, but then it calls stdcall procedures. As long as both caller scope and callee are clear on the calling convention in use, you can quite happily have a mix of calling conventions in a program, but if caller and callee don't agree, then many things can go astray. (It is possible a mix of calling conventions may also confuse the debugger.) CVF and IVF have different defaults in this regard, different compiler options would also change the default convention, so be mindful of left over bits from a previous build where things may have been different.
- LEAVE and ENTER are instructions executed as part of the function prologue and epilogue, that establish and then tear down the stack frame. Their effect may be accomplished, perhaps more often than not by a series of "simpler" instructions (because the equivalent series of instructions is often faster!), hence you may well see one without seeing the other. They are not replacements for CALL and RET.
- Code that requires /Qinit:zero is code that hasn't done a proper job of initially defining variables. The code should be made to go and sit in the corner until it initially defines its variables correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In regards to how I determined the variable was corrupt - initially I ran the program and noticed it was entering an IF block based on a variable which should have been false. I then stepped through the code in visual studio and noticed the variable was changing after returning from a function. Since this made no sense, I went further with the disassembly and narrowed it down to the LEAVE instruction. All inspections of the variable were done at an appropriate scope where the variable should have been valid.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this
Prior to making the call the database function/subroutine that produces the symptom:
a) Set a Watch to the variable that changes
b) Open a Memory window and set the memory window to the same variable. Note you may need to use LOC(variableNameHere)
Now step over the call
If the Watch window changes and the Memory window does not, then your calling convention to the database is incorrect. This indicates the stack frame was not restored properly across the call.
If both change, then:
a) the calling conventions between the Fortran side and the database side are likely not agreeing on reference or value
b) the database is expecting a C-style NULL terminated string. Fortran does not use NULL terminated strings.
c) The size of an output array (passed to the database) may be incorrect or misunderstood (this may be a result of a) where the reference of the size of the array is passed as opposed to the value of the size of the array)
There may be others
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The stuff about the LEAVE instruction is a bunch of hogwash. See http://support.amd.com/TechDocs/40546.pdf , section 4.9. LEAVE is just being used as a way to clean up the frame pointer; no ENTER is required or generally desirable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Your advice about observing the state in a Watch window and a Memory window was key to solving this problem.
Once I had the Memory window open I saw that a large area of memory was being overwritten by a long sequence of integers. I also noticed this happened immediately after f90SQLFetchScroll was called, as opposed to the LEAVE statement as I had observed earlier. With that information I discovered that an array which tracks the status of the returned database rows was being overrun. In other words it was a simple programming error exactly as you predicted.
The solution was to change the row status array of kind SQLUSMALLINT_KIND to SQLUINTEGER_KIND. I believe this issue occurred because the application was built for a very old ODBC driver which returned unsigned shorts while a modern ODBC driver, such as the one on my workstation, returns unsigned integers.
I don't know why the only difference between a working and broken build was the MOV/JMP statements. I also don't know how I was initially able to produce a working build without having fixed this bug. I'll need to do some further investigation to make sense of that and to confirm that the problem is fixed. I hope this resolution helps anyone facing a similar issue when migrating f90SQL to a modern machine.
To all,
I appreciate your guidance and your patience. I wouldn't have figured this one out without your insight.
Regards,
Charles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Glad to help. It is good that you were able to find the "There may be others".
RE: "I also don't know how I was initially able to produce a working build without having fixed this bug"
There is a difference between a build that runs and a build that runs correctly.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Charles S. wrote:
I also don't know how I was initially able to produce a working build without having fixed this bug.
As Jim intimates ( and from my own experience) often from one build to another with some small changes the memory organisation is different so you overwrite different stuff (non-critical/ already used and no longer needed) and the program can "run correctly".
Does f90SQL not have USE module files for interface checking BTW? That would have flagged the error.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page