Optimization problems at level /O2

emonette123 · ‎09-27-2011

Here is a code snippet I compiled with Intel Fortran 12.0.4.196, banner is:

Intel Visual Fortran Intel 64 Compiler XE for applications running on Inte
l 64, Version 12.0.4.196 Build 20110427
Copyright (C) 1985-2011 Intel Corporation. All rights reserved.

and compilation line is:

ifort /nologo /f77rtl /Qsave /Qzero /O2 /assume:nodummy_aliases /MT

The code snippet is (good old F77...):

IF( MODE .EQ. 2 )THEN
C Face on element
DO 100 L=1,3
A(L) = RFNODE(L,IFELEM(5,FACE))
B(L) = RFNODE(L,IFELEM(6,FACE))
C(L) = RFNODE(L,IFELEM(7,FACE))
IF( IFELEM(2,FACE) .EQ. 115 .OR.
+ IFELEM(2,FACE) .EQ. 116 .OR.
+ IFELEM(2,FACE) .EQ. 119 .OR.
+ IFELEM(2,FACE) .EQ. 120 )THEN
POINT(L) = RFNODE(L,IFELEM(9,FACE))
ELSE
POINT(L) = RFNODE(L,IFELEM(8,FACE))
ENDIF
100 CONTINUE
ELSE
INDX = 9
IF( IFACE(9,FACE).EQ.0 ) INDX = 8
DO 110 L=1,3
A(L) = RFNODE( L, IFACE(6,FACE) )
B(L) = RFNODE( L, IFACE(7,FACE) )
C(L) = RFNODE( L,IFACE(INDX,FACE) )
110 CONTINUE
ENDIF

In the above case, MODE is 2, MODE and FACE are received as arguments. Obviously, the IF in the loop 100 can be placed out of the loop. Alas, this code crashes when compiled with optimization. Here is the assembly generated, which I hope is around the test for mode .eq. 2. (the jne instruction is just after a comparison with 2)

000000013F6C1FE8 jne 000000013F6C21D7
000000013F6C1FEE movdqa xmm3,xmmword ptr [13FA47E30h]
000000013F6C1FF6 lea r10,[rax*8]
000000013F6C1FFE sub r10,rax
000000013F6C2001 shl r10,4
000000013F6C2005 movdqa xmm2,xmmword ptr [13FA47E40h]
000000013F6C200D movsxd rdx,dword ptr [r10+r9-58h]
000000013F6C2012 imul rdx,rdx,9Ch
000000013F6C2019 movsxd rcx,dword ptr [r10+r9-5Ch]
000000013F6C201E imul rcx,rcx,9Ch
000000013F6C2025 cvtps2pd xmm4,mmword ptr [rdx+r14-9Ch]
000000013F6C202E cvtps2pd xmm15,mmword ptr [rcx+r14-9Ch]
000000013F6C2037 movsxd rbx,dword ptr [r10+r9-60h]
000000013F6C203C imul rbx,rbx,9Ch
000000013F6C2043 movsxd rax,dword ptr [r10+r9-50h]
000000013F6C2048 imul rax,rax,9Ch
000000013F6C204F cvtps2pd xmm1,mmword ptr [rbx+r14-9Ch]
000000013F6C2058 mov r8d,dword ptr [r10+r9-6Ch]
000000013F6C205D movaps xmmword ptr [13FB29530h],xmm4
000000013F6C2064 movdqa xmm4,xmmword ptr [13FA47E20h]
000000013F6C206C movd xmm0,r8d
000000013F6C2071 pshufd xmm5,xmm0,0
000000013F6C2076 movdqa xmm0,xmmword ptr [13FA47E50h]
000000013F6C207E pcmpeqd xmm4,xmm5
000000013F6C2082 pcmpeqd xmm3,xmm5
000000013F6C2086 pcmpeqd xmm2,xmm5
000000013F6C208A pcmpeqd xmm5,xmm0
000000013F6C208E movdqa xmm0,xmm4
000000013F6C2092 movaps xmmword ptr [13FB29510h],xmm15
000000013F6C209A movdqa xmm15,xmm3
000000013F6C209F punpckldq xmm0,xmm4
000000013F6C20A3 orps xmm4,xmm3
000000013F6C20A6 movaps xmmword ptr [13FB294F0h],xmm1
000000013F6C20AD orps xmm4,xmm2
000000013F6C20B0 punpckldq xmm15,xmm3
000000013F6C20B5 orps xmm4,xmm5
000000013F6C20B8 cvtps2pd xmm1,mmword ptr [rax+r14-9Ch]

It crashes at the last line, rax is zero and it should not be, anyway, it crashes with an address violation.

If I change the code for this:

IF( MODE .EQ. 2 )THEN
C Face on element
IF( IFELEM(2,FACE) .EQ. 115 .OR.
+ IFELEM(2,FACE) .EQ. 116 .OR.
+ IFELEM(2,FACE) .EQ. 119 .OR.
+ IFELEM(2,FACE) .EQ. 120 )THEN
INDX = 9
ELSE
INDX = 8
ENDIF
DO 100 L=1,3
A(L) = RFNODE(L,IFELEM(5,FACE))
B(L) = RFNODE(L,IFELEM(6,FACE))
C(L) = RFNODE(L,IFELEM(7,FACE))
POINT(L) = RFNODE(L,IFELEM(INDX,FACE))
100 CONTINUE
ELSE
INDX = 9
IF( IFACE(9,FACE).EQ.0 ) INDX = 8
DO 110 L=1,3
A(L) = RFNODE( L, IFACE(6,FACE) )
B(L) = RFNODE( L, IFACE(7,FACE) )
C(L) = RFNODE( L,IFACE(INDX,FACE) )
110 CONTINUE
ENDIF

Everything works fine.

I am really concerned about this since there are a lot of places in our code where this optimization can take place and we have millions of lines of code. And (obviously) this only happens in optimized mode, so it is a pain to debug. Should we upgrade to some more recent version of the compiler. Has this problem been addressed? Is there any more info I can provide?

This code used to compile ok in version 9.1, with /optimize:2. This code also compiles ok with optimization level /O1, which is equivalent to /optimize:2 when looking at the deprecation help of version 12.0. So for now, I will stick to /O1...

Thanks for any hint,

Etienne Monette

Steven_L_Intel1 · ‎09-27-2011

Please attach a complete, compilable source that demonstrates the problem. We cannot investigate based on a snippet.

emonette123 · ‎09-27-2011

Ok, here we go. The zip file contains everything needed to compile. To compile:

ifort /nologo /f77rtl /Qsave /Qzero /O2 /assume:nodummy_aliases /MT /c /Fosplit9.obj split9.f
link /nologo /MANIFEST /OUT:optbug.exe z8d.lib tmg.lib apcfg.lib CharFortranC.lib wizintf.lib z8d.lib esc.lib nx2tmg.lib emc.lib octree2.lib xmlwrapper.lib expat.lib version.lib MayaSecurityValidator.lib tomcrypt.lib nx2tmg_main.obj split9.obj

Execute this line to obtain the crash:

optbug.exe -s NXTMG68c-Solution_1.xml

Sorry about the many libraries, I tried to isolate the code, but it wasn't crashing. The code should execute until you see the line performing mesh check and then crash.

By the way, this is on Windows 7, 64 bits.

Etienne

mecej4 · ‎09-27-2011

Sorry, no crash! Last lines of output:

[bash] Sub-domain    Velocity         Length          Reynolds     Mach
 ----------------------------------------------------------------------
   1- AIR      0.0000E+00 mm/s  9.9477E+01 mm   0.0000E+00   0.0000E+00
 ----------------------------------------------------------------------

 Writing flow model files...
 ...done.

 Writing thermal model files...
 ...done.
_#I NX2TMG 2 87E294278CCC534A9F621143DC626BC025A5BC6D 7B457DABB47D9AEC1546B776292BD977B10E08CE$
_#I NX2TMG 2 E3683141BBBB2FBEFC84170AF3A438FFBE0FBE43 924B20CFEF0E13ACC505D521B3D3FDCBA5D2CE74$
_#I NX2TMG 2 45C7F4A71927CA734757638CC2B7886BEECF7AA6 1214CF71FDCFDFC3C174FE8E441866751F94D7B2$
_#I NX2TMG 2 BA945355A611941156BEBE98A8F1B7B7E6C317A1 CDAEBA6676B6A23E3B87D716916A8CCC5582AD46$
_#I NX2TMG 2 54C2885CA023CA307A57E8F0FCAB32367382CEAC CD80DE2F684F767874FBAF11C5DD2E9C413F7C5F$
_#I NX2TMG 2 B97329E1436D76025DF26BB4BF9FE3C0FE36851D 2923DBB2CB707BFEA8B61C241262EED30014C6E6$[/bash]

emonette123 · ‎09-28-2011

Indeed, it didn't crash on your computer. So I tried it out around here. It crashes on my computer (Windows 7), and on an XP 64 computer. But it passes on another Windows 7 computer. So we are even at 2 crashes and 2 passes.

Can you try it out on more than one computer. In order to crash, you have to access an out of bound memory address, and this looks to be computer specific. When it crashes on my computer, it does on that assembly line:

000000013F6C20B8 cvtps2pd xmm1,mmword ptr [rax+r14-9Ch]

From what I understand (my x64 assembly is not that good!) rax represents an offset in the RFNODE array and so does rbx:

000000013F6C2037 movsxd rbx,dword ptr [r10+r9-60h]
000000013F6C203C imul rbx,rbx,9Ch
000000013F6C2043 movsxd rax,dword ptr [r10+r9-50h]
000000013F6C2048 imul rax,rax,9Ch

When it crashes, rax is 0 and rbx is 156, which represents the index 1 in the RFNODE array, which is 39*4=156 wide.
I added these print statement, giving respectively MODE,IFELEM(2,FACE) and IFELEM(5..9,FACE)

Mode: 2
Type: 111
Node 1: 1
Node 2: 2
Node 3: 3
Node 4: 4
Node 5: 0

As you see, only IFELEM(9,FACE) is zero, but since the type is 111, it is IFELEM(8,FACE) which should be used. The assembly line where it crashes represents dereferencing a Fortran array at index 0 and this does not always crash.

It looks to me like the optimizer tried to remove the IF statement out of LOOP 100, but it did not do it correctly...

Etienne

Steven_L_Intel1 · ‎09-28-2011

I tried it on Windows 7 x64. No errors with either 12.0.4 (which you used) or 12.1.0 (current version). I even took the EXE you had prebuilt in the ZIP and ran it as-is - no crash.

emonette123 · ‎09-28-2011

I tried the prebuilt executable in the zip file on 3 computers here, and it crashes on 2 out of 3...
As I explained, it dereferences an array at index 0 which means it will only crash if the array is on the limit of the legal memory space for the application.

I already have a solution for this which is to use /O1 optimization level, but I really feel uneasy about this since /O2 is considered the default optimization level.

I really hope this not to be a compiler problem, but from what I saw in the assembly and from the tests I did, everything points towards that direction. And I am aware 99.9% of the bugs are programmers one. But this happens in such a simple routine it makes the assembly code quite easy to follow. And what I see in the assembly is wrong.

I am willing to work out something to enable a remote connection to my computer if this might help.

Etienne

Steven_L_Intel1 · ‎09-28-2011

I enabled bounds checking and it saw no issues. You may have some background program on your computers that is corrupting things.

emonette123 · ‎09-28-2011

Should have thought of it before. I made the value 0 a negative value close to minimum 32 bits integer. So now, it is way out of bounds.

See attached file, which includes the aggressive_optbug.exe and the corresponding nx2tmg.lib files.

This one did crash on the computer previously not crashing. It should fail on any computer now.

And again, everything is fine with /O1 optimization, the problem only occurs with /O2.

Etienne

mecej4 · ‎09-28-2011

I tried something different. On a W7X64 system, on which your EXE (from post #2) ran to completion without errors, I ran your EXE under the VS2010 debugger. This time, the program crashed as you reported, with %rax = 0. What this suggests is either (i) an uninitialized variable or, less likely, (ii) optimizer bug.

emonette123 · ‎09-28-2011

With the aggressive version, you will get %rax to be very negative... What I did was set IFELEM(9,1) to -2000000000 just before calling the SPLIT9 sub and set it back to 0 just after. In the case we are interested in, IFELEM(9,1) should not be used, but it is.

Etienne

mecej4 · ‎09-28-2011

Here is a suggestion to probe your suspicions about the optimizer.

The number of arguments to the subroutine is fairly small, and the sizes of the arguments are quite modest. Make up a small test program which sets all the subroutine arguments to what they should be (using initialization statements, or reading from a text file) and calls the subroutine.

The test program can be built without using any of your libraries. This test problem can then be run with various compiler options and tested.

emonette123 · ‎09-28-2011

I already tried that without success, which is why I sent out all the libs. But I will try this out again with the aggressive approach.

emonette123 · ‎09-28-2011

Finally nailed it. Attached are two files, cannot get simpler than this. Here is how I compiled:

ifort /nologo /f77rtl /Qsave /Qzero /O2 /assume:nodummy_aliases /MT /c /Fooptbug.obj optbug.f
ifort /nologo /f77rtl /Qsave /Qzero /O2 /assume:nodummy_aliases /MT /c /Fosplit9.obj split9.f
link /nologo /MANIFEST /OUT:optbug.exe optbug.obj split9.obj

Plain fortran, plain simple.

Again, no problems with /O1.

Etienne

P.S.

Went a step further:

ifort /nologo /O2 /c /Fooptbug.obj optbug.f
ifort /nologo /O2 /c /Fosplit9.obj split9.f
link /nologo /MANIFEST /OUT:optbug.exe optbug.obj split9.obj

Will crash as soon as optimization level is 2 or 3.

[Also removed dependencies to .inc files in split9.f]

mecej4 · ‎09-28-2011

I am not sure what this cut-down example accomplishes.

It shows that if a subscript used is out of its bounds, the program may or may not crash depending on the optimization level and on the contents of uninitialized memory.

I don't think that this establishes anything regarding optimizer bugs.

I still suspect subscript bounds errors and uninitialized variables in your large application.

emonette123 · ‎09-28-2011

The cut down example accomplishes this:

Since IFELEM(2,1) equals 111, the IF statement in LOOP 100 of split9 is always false.
Then, POINT(L) is always equal to RFNODE(L,IFELEM(8,FACE)).

But because of optimizations, the value RFNODE(L,IFELEM(9,FACE)) is prefetched and I made it out of bounds on purpose. I cannot guarantee IFELEM(9,FACE) to be a valid index for a fortran array at all times. Nobody can. Even if it was initialized to zero, which is good practice, this code could fail as zero is not a valid index in Fortran. And this is exactly what is happening in the bigger application (it crashes my computer, but not yours).

The cut down example clearly shows an illegal (or dangerous) optimization.

If you debug the application, you will see line 31 of split.f is never executed. But the optimized code is executing part of it, I do not know exactly why, but I suspect prefetching a value to optimize a branch.

If optimization level /O2 means enabling prefetching of any value in the code, then for sure, our code must avoid this at all costs. But the help of ifort command tags /O2 as the default optimization level, and if this means that you have to make sure both sides of any branch must have fetchable values... Well, this is too big a requisite to me. I will stick with level /O1.

Etienne

Steven_L_Intel1 · ‎09-28-2011

Please try Update 6. I can reproduce the access violation with Update 4 but not with Update 6.

emonette123 · ‎09-28-2011

I will, in the mean time, I really reduced the problem to its simplest expression, see attached file.
I compiled it on Linux, with Portland Group compiler, fully optimized to level 3, no problem.

Here is how I compiled:

ifort /nologo /O2 /c /Fooptbug.obj optbug.f
ifort /nologo /O2 /c /Fomysub.obj mysub.f
link /nologo /MANIFEST /OUT:optbug.exe optbug.obj mysub.obj

Etienne

mecej4 · ‎09-28-2011

I do not have 12.0.4 installed anymore. Both 12.0.5 and 12.1, given your mysub.f:

[fortran]      SUBROUTINE MYSUB(TABLE,TESTME)
C
      IMPLICIT NONE
      INTEGER TESTME(*)
      REAL*8 TABLE(2,*)
      REAL*8 SETME(2)
      INTEGER L
C
      PRINT *, TESTME(1)
      DO L=1,2
         IF(TESTME(1) .EQ. 1 .OR. TESTME(1) .EQ. 2) THEN
C           Since TESTME(1) is 0, we will never get in here...
            SETME(L) = TABLE(L,TESTME(2))
         ENDIF
      ENDDO
C
      RETURN
      END[/fortran]

will not produce any code for the DO loop if any optimization level other than 0 is enabled; this is not because of the value of TESTME(1) being 1 as you claim, since this value is not known at compile time, but because the sole effect of the loop is to assign values to a local variable which is not used elsewhere in the subroutine. Indeed, the assembly output is

[bash]        PUBLIC MYSUB
MYSUB   PROC
; parameter 1: rcx
; parameter 2: rdx
        sub       rsp, 104
        mov       r8, 0208384ff00H
        mov       eax, DWORD PTR [rdx]
        lea       rcx, QWORD PTR [48+rsp]
        mov       edx, -1
        lea       r9, QWORD PTR [_2_STRLITPACK_0.0.1]
        mov       QWORD PTR [48+rsp], 0
        lea       r10, QWORD PTR [96+rsp]
        mov       DWORD PTR [96+rsp], eax
        mov       QWORD PTR [32+rsp], r10
        call      for_write_seq_lis
        add       rsp, 104
        ret
[/bash]

Note that the fourth instruction overwrites the address of TABLE. You can set TESTME(1) = 77 in the main program, and there will be no effect on the code produced for the subroutine (assuming IPO is not being used).

Furthermore, if you comment out the PRINT statement in the subroutine, the assembly output is trivial:

[bash]MYSUB   PROC
; parameter 1: rcx
; parameter 2: rdx
        ret        
[/bash]

Again, note that this collapsing of the code is because of optimization of the subroutine alone, since the compiler has not even seen the source code of the main program, where you set TESTME(1) = 1.

emonette123 · ‎09-29-2011

Update 6 fixed the problem...

I will recompile and retest my code with /O2.

Etienne

P.S.

Done with recompilation and testing, I no longer experience crashes. The execution is faster now (we used to optimize with /O1). So far, so good.

Thanks for your help, and I hope Update 6 will be the good one...