Strange optimizer effect

groupw_bench · ‎02-16-2012

I hope somebody can help explain this. It took several days of hard work to track it down.

I'm compiling some old code which had been working ok with earlier compilers, but ran into trouble compiling it for the x64 platform. While trying to find that problem I discovered that it would produce incorrect results when compiled for Win32 if any optimization was used. After a lot of trial and error I also discovered that the problem went away (even when optimizing at the /O2 level) if I specified that a run time array bounds check be done. Because it happened only with optimization, I couldn't use the debugger so began tracking by having the program write values to an output file. I was zeroing in on a particular loop by finding different intermediate results in the "bad" (compiled with /O2 and giving wrong answers) and "good" (compiled with /O2 but with run time array bounds checking and giving correct answers) programs. But I discovered that just putting a write statement in the middle of the following loop fixed the "bad" program. Not too surprisingly, a NOVECTOR directive for that loop cured the symptom I'd been seeing. But I suspect there are similar problems lurking elsewhere in the program and I need to understand this so I can track them down.

A symptom that I tracked down was that the sign of Y(3) was negative when it should have been positive. The absolute value was correct. There might easily be other errors, but that's one that I was able to positively identify. XS1, YS1, and ZS1 are simply assigned the values of variables sent in as parameters, prior to the DO loop.

DO I=IST,N
ITAG(I)=ITG
XS2=XS1+XD*DELZ
YS2=YS1+YD*DELZ
ZS2=ZS1+ZD*DELZ
X(I)=XS1
Y(I)=YS1
Z(I)=ZS1
X2(I)=XS2
Y2(I)=YS2
Z2(I)=ZS2
BI(I)=RADZ
reSegWireIns(1, I) = WInsEpsr
reSegWireIns(2, I) = WInsThk
reSegWireIns(3, I) = WInsLossTan
DELZ=DELZ*RD
RADZ=RADZ*RRAD
XS1=XS2
YS1=YS2
ZS1=ZS2
END DO

My program compiled for the x64 platform is showing what might be the same problem, although it's happening without optimization so it may be different. In any case, I'd like to find any other loops which might behave the same way -- it's nearly impossible to test every one of them in the complex program. If there's something fundamentally wrong with the code it might cause trouble even without optimization and needs to be fixed.

There are a few unusual things about the loop. Arrays X, Y, Z, X2, Y2, Z2, and BI are REAL (KIND = 4) and dimensioned to an integer value. All, including the dimension, are passed in as parameters. This is true also for the calling routine. In the procedure which calls it, however, the arrays are allocatable (although still the same type and kind, and allocated to the same dimension). I have checked, and the limits of I never exceed the array bound -- unless the optimizer causes them to. Putting a write statement in the loop in an attempt to find out makes the loop function correctly so I can't tell what the optimizer is doing.

Several of the other variables are REAL (KIND = 8): XS1, YS1, ZS1, XS2, YS2, ZS2 for example. While I don't see how this should hurt anything, it might have some effect on the optimizer.

Here's the relevant part of the level 1 optimizer output without the NOVECTOR directive, when it's creating the defective program. The problem loop is in WIRE which is called by DATAGN which is called by the main program. CONECT is another subroutine called by DATAGN after the call to WIRE. The problem loop DO statement is at line 657 of WIRE.

1>file1.for
1><;-1:-1;IPO;;0>
1>WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
1>WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
1>WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false
1>INLINING OPTION VALUES:
1> -Qinline-factor: 100
1> -Qinline-min-size: 20
1> -Qinline-max-size: 230
1> -Qinline-max-total-size: 2000
1> -Qinline-max-per-routine: 10000
1> -Qinline-max-per-compile: 125000
1>
1><>
1>INLINING REPORT: (_DATAGN) [1/3=33.3%]
1> -> _asin(EXTERN)
1> -> _ATGN2(EXTERN)
1> -> ARGS_IN_REGS: _CONECT.(2) (isz = 568) (sz = 595 (311+284))
1> -> _for_cpstr(EXTERN)
1> -> ARGS_IN_REGS: _WIRE.(1) (isz = 131) (sz = 178 (60+118))
1>
1>HPO VECTORIZER REPORT (_DATAGN) LOG OPENED ON Thu Feb 16 00:38:10 2012
1>
1><>
1>HPO Vectorizer Report (_DATAGN)
1>
1><>
1>INLINING REPORT: (_WIRE) [2/3=66.7%]
1>
1>HPO VECTORIZER REPORT (_WIRE) LOG OPENED ON Thu Feb 16 00:38:10 2012
1>
1><>
1>HPO Vectorizer Report (_WIRE)
1>D:\\Documents\\Visual Studio Net\\Projects\\TestDir\\file1.for(658:7-658:7):VEC:_WIRE: PARTIAL LOOP WAS VECTORIZED
1>PARTIAL LOOP WAS VECTORIZED
1>HLO REPORT LOG OPENED ON Thu Feb 16 00:38:10 2012
1>
1><>
1>High Level Optimizer Report (_WIRE)
1><>
1>LOOP DISTRIBUTION in _WIRE at line 658
1>LOOP DISTRIBUTION in _WIRE at line 658
1><>
1>INLINING REPORT: (_CONECT) [3/3=100.0%]
1>
1>HPO VECTORIZER REPORT (_CONECT) LOG OPENED ON Thu Feb 16 00:38:10 2012
1>
1><>
1>HPO Vectorizer Report (_CONECT)
1>
1><>
1>High Level Optimizer Report (_CONECT)
1><>
1> STATIC: D:\\Documents\\Visual Studio Net\\Projects\\TestDir\\file1.for _WIRE.
1><>
1> STATIC: D:\\Documents\\Visual Studio Net\\Projects\\TestDir\\file1.for _CONECT.
1><>
1> STATIC: D:\\Documents\\Visual Studio Net\\Projects\\TestDir\\file1.for _DATAGN
1><>
1> STATIC: D:\\Documents\\Visual Studio Net\\Projects\\TestDir\\file1.for _CONECT
1><>
1> STATIC: D:\\Documents\\Visual Studio Net\\Projects\\TestDir\\file1.for _WIRE
1><;-1:-1;PGO;;0>
1> 5 FUNCTIONS HAD VALID STATIC PROFILES
1> IPO CURRENT QUALITY METRIC: 50.0%
1> IPO POSSIBLE QUALITY METRIC: 50.0%
1> IPO QUALITY METRIC RATIO: 100.0%

I'm very inexperienced at Fortran programming and new to the latest IVF compiler I'm using, and I can't get much meaningful information from this report except that it's definitely doing something to the loop. So any help with understanding this is greatly appreciated!

mecej4 · ‎02-16-2012

Please provide a complete example source code that exhibits the error, and make it as short as possible. State which compiler version you use, the compiler options used, and identify the OS.

There is rather too much diagnosis and too little description of symptoms here.

TimP · ‎02-16-2012

In the source code you quote, you have apparent (but false) loop carried recursion.
It would be safer to write
DELZ = (I-1) * RD + ?
RADZ = (I-1) * RRAD + ?
if that is what is meant, rather than ask the compiler to work around the recursion.
Likewise, write explicitly what XS1,YS1,ZS1 are for each I rather than asking the compiler to unravel recursion and find the initializations which you don't show.
It's impossible to guess whether the compiler would split out (distribute) the easily vectorizable assignments or peel off the first iteration in an attempt to optimize obscurely written code.

groupw_bench · ‎02-16-2012

Quoting TimP (Intel)

In the source code you quote, you have apparent (but false) loop carried recursion.
It would be safer to write
DELZ = (I-1) * RD + ?
RADZ = (I-1) * RRAD + ?
if that is what is meant, rather than ask the compiler to work around the recursion.
Likewise, write explicitly what XS1,YS1,ZS1 are for each I rather than asking the compiler to unravel recursion and find the initializations which you don't show.
It's impossible to guess whether the compiler would split out (distribute) the easily vectorizable assignments or peel off the first iteration in an attempt to optimize obscurely written code.

I'm sorry, my level of understanding is too low to follow what you're suggesting. But it looks to me like direct calculation of DELZ, for example, would require DELZ*(RD**I), not (I-1) * RD as in your example. What am I missing there?

XS1, YS1, and ZS1 are initialized with appropriate values prior to the DO loop.

The reason the code was written as it was, is that while X, Y, and Z are KIND=4, most of the other variables such as X/Y/ZS1 and X/Y/ZS2 are KIND=8. I believe this was done to reduce accumulating error. And mixed variable types seem to be involved in the problem.

groupw_bench · ‎02-16-2012

Quoting mecej4

Please provide a complete example source code that exhibits the error, and make it as short as possible. State which compiler version you use, the compiler options used, and identify the OS.

There is rather too much diagnosis and too little description of symptoms here.

Thanks, I spent most of the day trying to do just that. I replicated the problem loop and its subroutine, variable kinds, dimensions, and assignments, and so forth as well as I could, but the simplified version produces correct results. Exactly the same values are being sent into the subroutine as in the full program. So apparently the optimizer is looking farther back than I replicated.

I did, however, run across a simple way to make the problem go away. I moved

XS1=XS2
YS1=YS2
ZS1=ZS2

from the bottom of the DO loop to the top, and initialized X/Y/ZS2 instead of X/Y/ZS1 before the loop. That makes the optimizer happy. I have no clue why.

I've been over the code with a fine toothed comb and can't find anything defective about it. The simplified program also works fine with exactly the same loop. So apparently there are just some loops that the optimizer scrambles and some it doesn't. I'll have to avoid using any optimization since it would be impossible to exercise every loop in the program to find any others which have been trashed.

Oh, regarding your question about my system. I'm using the Composer XE 2011 SP1 in the Visual Studio 2010 environment running under 64 bit Windows 7. The problem program is directed to a Win32 platform -- so far I haven't been able to get the x64 compiled version to do anything but start and immediately stop without any errors reported. Here's one set of compiler and linker command lines that cause the optimization problem:

/nologo /Oy- /DLs=1 /DLi=1 /DLe=1 /DPlatform=0 /fpscomp:nolibs /warn:interfaces /fpconstant /module:"TestPgm Release/" /object:"TestPgm Release/" /Fd"TestPgm Release\vc100.pdb" /libs:static /threads /c

/OUT:"TestPgm Release/TestPgm.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"D:\Documents\Visual Studio Net\Projects\TestPgm\TestPgm Release\TestPgm.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /IMPLIB:"D:\Documents\Visual Studio Net\Projects\TestPgm\TestPgm Release\TestPgm.lib"

I haven't counted them up, but I'd guess that the program has around 100,000 lines of code. So I'm sorry that the simplified one doesn't exhibit the problem.

Thanks!

mecej4 · ‎02-17-2012

Often, the only way to pin down optimizer bugs is to start with the full program affected by the problem, and whittle away pieces. Each time after dropping lines of code or commenting out subprogram calls, one has to rebuild and run to see if the precious bug is preserved.

There are some guidelines that help to reduce the effort, particularly when the bug does not require compiling with /Qipo to exhibit itself. Most of the code that will be executed after the suspected part can be removed at the outset. A shorter program can be constructed by collecting the argument values at entry to the subprogram with the bug, and writing a shorter program that merely calls the problematic subroutine directly and with the same arguments.

A code profiler can be used to detect and remove entire subprograms that are not called prior to the problematic segment.

In the first few iterations of the procedure that I outlined above, it feels that little progress is being made. However, as the localization of the bug improves, the rate of decrease of lines of code increases dramatically.

The following is equivalent to the code excerpt that you provided above, and the optimizer can do similar reductions (in a semantic sense; actual reductions may be done in some internal intermediate representation/language).

[fxfortran] ITAG(IST:N)=ITG reSegWireIns(1, IST:N) = WInsEpsr reSegWireIns(2, IST:N) = WInsThk reSegWireIns(3, IST:N) = WInsLossTan DO I=IST,N X(I)=XS1 Y(I)=YS1 Z(I)=ZS1 XS1=XS1+XD*DELZ YS1=YS1+YD*DELZ ZS1=ZS1+ZD*DELZ BI(I)=RADZ DELZ=DELZ*RD RADZ=RADZ*RRAD END DO X2(IST:N-1)=X(IST+1:N) Y2(IST:N-1)=Y(IST+1:N) Z2(IST:N-1)=Z(IST+1:N) X2(N)=XS1 Y2(N)=YS1 Z2(N)=ZS1

[/fxfortran] What TimP stated in his response was that the repeated multiplication of the type in the expressions for XS1, .. can cause undesirable accumulation of error, especially if some of the variables are single-precision reals.