Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29249 Discussions

Omit Frame Pointers causing slowdown

groupw_bench
Beginner
2,197 Views
I'm just getting up to speed with IVF 9.1, and ran into a problem. A fairly simple section of code involving moving some data between two arrays is exceedingly slow (more than 10 times slower than it should be) when compiled with double precision real variables. (This code is about 20 lines in a program of a few tens of thousands of lines of code. Took a while to isolate it. The same problem might be occurring elsewhere in the program, but this code is in a section where any slowing is very noticeable.) The same code works fine in single precision and in another project. By trial and error, I discovered that I could fix the problem by setting the Omit Frame Pointers optimization option to No (I'm using the Visual Studio 2005 environment). The same code has been in use for years, compiled with CVF 6.6. I've checked it very carefully and can't find anything wrong or extraordinary about it.

I've read the description of this optimization option several times, but for the life of me can't make any sense out of the explanation for what it does. The program with this option set to No runs about as fast as another one using similar code where it's set to Yes, so there doesn't seem to be much or any of a speed hit by setting it to No. Am I likely to run into related problems with the optimizer, and is there some real disadvantage to setting Omit Frame Pointers to No?
0 Kudos
15 Replies
drgfthomas
Beginner
2,195 Views

By default 'Omit Frame Pointers' is NO (love that double negative) as in Enable Frame Pointers is yes. Together with incremental linking off, Frame Pointers are required to Traceback. If the Frame Pointers are off, the compiler has an additional register to work with and this may boost computational speed.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,195 Views

Can you post your 20 lines? Include subroutine header and declared variables.

Make sure you are not running one configurationwith Run Time checks enabled, or array bounds enabled or one with optimizations off. A run time ratio of 20:1 is indicative of these differences.

By omitting frame pointer your stack allocation is reduced by 4 bytes over that when you include frame pointer. This may be a case of alignment of local variables as well as other issues that may affect the processor caches.

Also, other options may be affecting your code generation. I would suggest the following:

Create a configuration based on the "Debug" configuration (I use "DebugFast"). Then set optimizations to Max Speed and enable other optimizations to your "Release" settings.

a) Compile one way and set break point at start of 20 line code section. On break open disassembly code window. You do not have to understand assembly code to do this. Then select the section of code in the disassembly window that represents the 20 lines of your source code. This may be 100's of lines.Copy to clipboard (Ctrl-C). Open a new text file window and paste the appropriate text in the window. Save the window (Save As using appropriate name "With FP" or "Without FP")

b) Compilethe otherway and set break point at start of 20 line code section. Select, copy and paste to an additional new text window, Save with appropriate other name.

Now with both windows available select a Side-by-side pane view and examine for code differences. Ignore the hex address differences.

Now then,

1) If the number of statements are the same then this is indicative that the performance issue is due to a memory alignment of local variables .OR. the alignment of the working data arrays are now unfavorable for processor cache access. Consider using directives to force alignment of local variables that you identify as sensitive.

!dec$ attributes align : 16 :: TOSVX1

real, automatic :: TOSVX1(3)

2) If the number of statements are slightly different due to the extra register being available in the "Without FP" then the problem is still likely the same as 1) above.

3) If the number of statements vary significantly then suspect different compiler options are in effect. Note, default options may be different with and without "Omit Frame Pointers". You may have to experiment with options to get what you want.

Also, remember that

If you are unable to resolve the performance issue using the above hints then consider using VTune (or other code profiler) to find the hot spot. This may lend a little more insight to the problem.

Jim Dempsey

0 Kudos
groupw_bench
Beginner
2,195 Views
Hm, I just bought and installed IVF v. 9.1, and in all the new projects, the option "Omit frame pointers" is set to Yes.
0 Kudos
groupw_bench
Beginner
2,195 Views
Thanks very much!

Here's the code:

SUBROUTINE XX1 (A,D,NROW,IP,LD2)

IMPLICIT NONE

INTEGER (KIND = 4) :: NROW,LD2
INTEGER (KIND = 4) :: IP(NROW)
COMPLEX (KIND = 8) :: A(NROW, *),D(LD2)
INTEGER (KIND = 4) :: I,J,J1,J2,IXJ,PJ,JP1,NROW,R,R1,R2
COMPLEX (KIND = 8) :: AJR

. . .
DO R = R1, R2
. . .
IXJ=0

c ------- Problem block -------

DO J = J1, J2
IXJ=IXJ+1
PJ=IP(J)
AJR=D(PJ)
A(J,R)=AJR
D(PJ)=D(J)
JP1=J+1
DO I = JP1, NROW
D(I)=D(I)-A(I,IXJ)*AJR
END DO
END DO

c -----------------------------
. . .
END DO
. . .

The assembly code generated as you suggested, with Omit Frame Pointers = Yes and No seems identical for the two cases (from DO J = J1, J2 through the second END DO), with one exception. At the very beginning of the block, right after DO J = J1, J2, the "Yes" option has a single additional instruction:

00492467 mov eax,dword ptr [ebp-0Ch]

Is this significant, or should I proceed as though the two are identical?


0 Kudos
Steven_L_Intel1
Employee
2,195 Views
No matter what, I would not expect a 10X slowdown just for this. Something else is going on.
0 Kudos
groupw_bench
Beginner
2,195 Views
I sure didn't expect it either. But all I have to do to solve the problem is to change the Omit Frame Pointers option setting.

So where should I begin looking for the "something else"?

0 Kudos
Steven_L_Intel1
Employee
2,195 Views
At this point I would ask that you send an example program and description of the problem to Intel Premier Support. We woild very much like to understand what is going on.
0 Kudos
drgfthomas
Beginner
2,195 Views

Frame Pointers can be set in IVF and VC++ and they need not be the same. Indeed, my IVF 9.1 on .NET 2003 defaults to NO and YES, respectively. As the documentation cautions they had better both be set to NO for Traceback to deliver anything meaningful. How does different settings affect EBD useage or not?

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,195 Views

The mov would not be significant.

If the tests were run with the same assembly code, i.e. you capture the dissassembly window and run the test in the debugger for both configurations, and if the run times are that much different then there could be an alignment issue. (Eliminate the possibiltiy of what you see in the debugger is NOT what ran the tests.)

If A(NROW, *), and D(LD2)are not allocatable then try using the

cDEC$ ATTRIBUTES ALIGN:16 :: A

in the appropriate module that declares the variable. Substituting the actual array name for the dummy argument name.

Also in XX1force alignment on AJR

cDEC$ ATTRIBUTES ALIGN:16 :: AJR
COMPLEX (KIND = 8) :: AJR

And, prior to doing this you can use the debugger Memory window to convert the symbolic name to an hex address. Examine where A(1,1) and D(1) are located.

An alignment with the hex address ending in 0 for A(1,1) and D(1) would yield the best performance. Your loop is small. If full optimizations are onthen AJR is likely registerized (it is complex and would occupy 2 registers).


The alignment issue alone would not account for a 10x difference in performance. What could though is cache collisions.

Therefore it is likelyD(I) and A(I,IXJ) are unfortunately aligned in a manner that causes cache collisions in one of the configurations.

For diagnostics you can opbtain the location of 1st elements in the inner DO loop

write(*,*) LOC(D(JP1)), LOC(A(JP1,IXJ))
DO I = JP1, NROW

Check out what you get with and without the Frame Pointer

Good luck hunting

Jim Dempsey

0 Kudos
groupw_bench
Beginner
2,195 Views
Steve was right - there was something else going on. The reason the code section was slow was because of the data it was processing. When I had Omit Frame Pointers = Yes, the routine was operating on unusual data because of failure to correctly read file data. I found out where the real problem is, but don't know why. It's in calling the Windows API function ReadFile.

I haven't run any tests with other than optimization for maximum speed (basically, the default Release settings). Under those conditions, the problem occurs only when:

-- I've compiled with the default real size of 8 bytes, AND
-- No debug information is being created, AND
-- Omit Frame Pointers = Yes

It's completely repeatable.

Because the problem disappears when I specify that debug information be written, I've been able to troubleshoot it only by having the program write diagnostic information to a file.

The problem is occurring at the following Windows API call:

lngResult = ReadFile(hFileHandle,lngLoc1,lngBytesToProc,
& lngLocBytesProc,NULL_OVERLAPPED)

lngResult, hFileHandle, lngLoc1, lngBytesToProc, and lngLocBytesProc are all INTEGER(KIND = 4). lngLoc1 = LOC(A(1)) where A is an ALLOCATABLE array of type COMPLEX(KIND = 8) (when compiled for default 8 byte real type) of dimension about 1,000,000, and lngBytesProc = LOC(lngBytesProc) where lngBytesProc is an INTEGER (KIND = 4) variable.

The file being read is about 200 MB in size. lngBytesToProc is 8477344. (The A array is filled with several reads from different parts of the file.)

In normal operation, the function call returns a value of 1 (success) for lngResult, and variable lngBytesProc contains 8477344, indicating that the requested number of bytes have been read. The file pointer is at 0 before the call and at 8477344 afterward. The values in the first 529834 (8477344/16) fields of A are those from the file.

When the problem conditions exist, the function call still returns 1 (success), but lngBytesProc contains zero, indicating that no bytes were in fact read. Array A is unchanged by the read operation, confirming that no bytes were read, or at least not put into the array. And the file pointer is at still at zero after the function call. I've confirmed that the values being sent to the function call are exactly the same when the problem does and doesn't occur, and that the content of the file being read is identical.

I can't find anything wrong with my code which, incidentally, has been working fine for several years compiled with CVF 6.6. So it looks to me like a bug in the IVF 9.1 compiler. For now, I'm going to just leave Omit Frame Pointers = No for all my IVF compilations.

0 Kudos
Jugoslav_Dujic
Valued Contributor II
2,195 Views
What does GetLastError say after ReadFile fails? I'd venture to guess that it would return some kind of memory problem. To me, it doesn't look like an IVF bug, but rather a consequence of a different memory layout, settings, RTL defaults or something along those lines.
0 Kudos
groupw_bench
Beginner
2,195 Views
Intrinsic function ERRSNS returns zero for all five parameters when called immediately after ReadFile, indicating that no error was detected. According to the IVF documentation, the second parameter sys_error should be the value returned by GetLastError. The zero (ERROR_SUCCESS) sys_error value is consistent with the value of 1 (true = success) returned by ReadFile. This set of symptoms could occur if, for example, the value of lngBytesToProc (third parameter of ReadFile) isn't getting sent to the function as it should. I verified that the variable lngBytesToProc does contain the correct value in my code immediately before the ReadFile call.

I can repeatably cause failure of the ReadFile call by changing nothing but Omit File Pointers; all other compiler options are the same for successful and failure cases.


0 Kudos
Steven_L_Intel1
Employee
2,195 Views
Please send a test case to Intel Premier Support and ask that it be assignd to Steve Lionel. I'd like to look at it.
0 Kudos
groupw_bench
Beginner
2,195 Views
I'll do that, but it might take a while. I just spent a few hours building a test program which opens and reads the same file as the real program using exactly the same methods and simplified versions of the same procedures. But of course the problem doesn't occur. The actual program is a few tens of thousands of lines of code, a lot of it old spaghetti Fortran-77 (or maybe even Fortran IV). And it's intended to be run by an external program which sets up a file structure that the Fortran program expects, and communicates via input and output files. It'll probably be necessary to send the whole program, so I'll have to set things up so you can run it independently and hopefully make some sense out of what's happening.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,195 Views

What happens when you compile everything with Omit Frame Pointers except for the modules that perform the ReadFile? Compiling this way will likely fix the problem and provide for the performance "tweak" of having an extra register available. If this does not fix the problemthen the problem is deeper than ReadFile and the symptoms just coincidentally happened to show up in ReadFile.

Jim Dempsey

0 Kudos
Reply