Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

ipo optimization exposes bug

Werner__Greg
Beginner
1,029 Views

I have a particle-in-cell simulation code for which (both v. 17 and 18) ifort optimization exposes a bug, and I'm not sure where to turn for help.  I have simplified the code as much as possible, so there are just a few lines of meaningful code.  The bug depends on optimization level, so it's frustrating but not surprising that the bug's appearance depends sensitively on a number of seemingly irrelevant code statements, which presumably affect how the code is transformed during optimization.

My knowledge of fortran is admittedly not very thorough or systematic; is there something subtle (or obvious!) that I'm doing wrong that encourages the compiler to optimize in a way that contradicts my intent?

The simplified code performs only addition/subtraction starting with double precision values of 0 or 1, so I would think that would tend to rule out finite precision problems.  More important, the code functions properly with low optimization and with runtime bounds checking -- so I think this should rule out really obvious bugs (like accessing invalid array values).

The bug occurs with "-O3 -ipo" or "-O3 -ipo-separate" but not with "-O3" or "-O2 -ipo".  Using "-fltconsistency" fixes the bug.  The following compilation options seem to have no effect (on whether the bug occurs): -fno-inline -no-vec -no-simd -no-scalar-rep -qno-opt-assume-safe-padding -falias -ffnalias -fprotect-parens -ip-no-inlining -ansi-alias -unroll=0.

One of the oddest things is that the bug occurs only if I link in 3 empty modules (with just 2 empty modules, it runs fine).

The code has main.F90, which simply calls the only function in mod_initial.f90.  I've attached the entire code and Makefile, but here's the function (I've tried to simplify further, but every additional simplification I make--removing irrelevant statements, removing the double loop, reducing the loop iterations--fixes the bug):

SUBROUTINE INIT_DRIFT_MAXWELLIAN()

IMPLICIT NONE

! Input parameter
INTEGER, PARAMETER                   :: ND = 3
DOUBLE PRECISION, DIMENSION(1:ND)    :: resRa, unifRa,irrelRa1,irrelRa2
INTEGER                              :: i,j

!unifRa=-135220.172189807d0
unifRa=1.d0

resRa=0.0d0

DO i=1,ND

  irrelRa2=0.d0 ! moving this outside loop fixes bug
  irrelRa1=0.d0  ! moving this outside loop fixes bug
 
  DO j=1,ND-1
    ! Following should be equivalent to resRa(i)=resRa(i)
    ! (up to possible finitie precision problems).
    resRa(i)=resRa(i)+unifRa(j+1)-unifRa(j)
  ENDDO

ENDDO

! Have to use results somehow or compiler will just optimize everything away.
IF (0==0 .AND. resRa(1) /= 0.d0) THEN
    PRINT *,'Halting.'
    PRINT *, "resRa ="
    PRINT *, resRa
    PRINT *, "unifRa="
    PRINT *, unifRa
    PRINT *, "ND=", ND
    PRINT *, "unifRa(1)=", unifRa(1)
    PRINT *, "irrelRa2(1)=", irrelRa2(1) ! removing this fixes bug
    PRINT *, "irrelRa1(1)=", irrelRa1(1) ! removing this fixes bug
    PRINT *, "resRa(1)=", resRa(1)
  ERROR STOP 125
ENDIF


END SUBROUTINE INIT_DRIFT_MAXWELLIAN

The array "resRa" should remain entirely zero (and nothing should be printed out): however, when the bug occurs, it yields the following output (where resRa = -1):

 Halting.
 resRa =
  -1.00000000000000       -1.00000000000000       -1.00000000000000     
 unifRa=
   1.00000000000000        1.00000000000000        1.00000000000000     
 ND=           3
 unifRa(1)=   1.00000000000000     
 irrelRa2(1)=  0.000000000000000E+000
 irrelRa1(1)=  0.000000000000000E+000
 resRa(1)=  -1.00000000000000     

This was complied with ifort 18.0.2 on stampede2 (at TACC) using (for example) options

-g -O3 -ipo-separate -fno-inline -no-vec -no-simd -no-scalar-rep -qno-opt-assume-safe-padding -falias -ffnalias -fprotect-parens -ip-no-inlining -ansi-alias -unroll=0

and was run on a KNL node.  The same bug occurs on skylake nodes of stampede2.

 

Thanks for any help,

Greg.

0 Kudos
7 Replies
Juergen_R_R
Valued Contributor I
1,029 Views

I cannot reproduce the problem on different flavors of Linux (RedHat 6, RedHat 7, Ubuntu 16). The binary does not print out anything.

0 Kudos
Werner__Greg
Beginner
1,029 Views

Thanks for trying.

I've reproduced the problem on a system with Intel Xeon E5-2680, RHEL 7.4, ifort 17.0.4, using the same compile options as in the previous attachment, as well as this shorter list of options:

 

ifort -g -O3 -ipo-separate -c mod_consts.F90
ifort -g -O3 -ipo-separate  -o mod_in_recon.o -c mod_in_recon.f90
ifort -g -O3 -ipo-separate  -c mod_enum.f90
ifort -g -O3 -ipo-separate  -c mod_initial.f90
ifort -g -O3 -ipo-separate -c main.F90
ifort -g -O3 -ipo-separate  -o a.out mod_consts.o mod_in_recon.o mod_enum.o mod_initial.o main.o
~/debugIpo$ ./a.out
 Halting.
 resRa =
  -1.00000000000000       -1.00000000000000       -1.00000000000000     
 unifRa=
   1.00000000000000        1.00000000000000        1.00000000000000     
 ND=           3
 unifRa(1)=   1.00000000000000     
 irrelRa2(1)=  0.000000000000000E+000
 irrelRa1(1)=  0.000000000000000E+000
 resRa(1)=  -1.00000000000000     
125

0 Kudos
Werner__Greg
Beginner
1,029 Views

The staff at NASA's Pleiades have reproduced this bug (on SLES 12) using ifort versions 15, 16, and 18.0.0.128 and 18.0.3.222.  However, they also tried using ifort 19.0.3.199 (which I think isn't yet officially supported on Pleiades), and the bug does not appear for ifort 19.  [ifort 19 is not available on most systems I use.]  The question is now: does the program work with ifort 19 because there was a compiler bug up through v. 18 that was fixed in v. 19, or did v. 19 introduce some serendipitous change in optimization that prevents this bug from being exposed (i.e., some harmless code changes on my part could result in the re-appearance of the bug)? 

For example, in the full simulation code, the bug does not appear with ifort 17 (which why we noticed it only when stampede2 upgraded to ifort 18); however,  in the simplified code, the bug does appear in ifort 17.

[Pleiades staff also suggested I compile with all warnings enabled to see if there could be some subtle issue in my code: except for complaining about my "ERROR STOP 125" command, which I've now replaced with "STOP 125," ifort and gfortran issue no warnings.]

0 Kudos
Juergen_R_R
Valued Contributor I
1,029 Views

Indeed, I tested only with ifort 19.0.3.199. With both ifort 18.0.5.274 and ifort 17.0.8.262 I can reproduce the behavior. I switched on all checks with the nagfor compiler, no output, no errors, no warning, so expected behavior.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,029 Views

For a work-around, try using

!DIR$ NOVECTOR

on the outer loop first, it that doesn't work, then on the inner loop also.

There appears to be no option to disable loop collapse for non-OpenMP loops (could be undocumented).

Your sample code showed small value for ND. If the actual code uses relatively small ND, lack of vectorization might not make a difference.

I do not have a system that reproduces the problem here.

Jim Dempsey

0 Kudos
Werner__Greg
Beginner
1,029 Views

Thanks!  Placing

!DIR$ NOVECTOR

before the inner loop fixes the problem (whether's there's a novector statement before the outer loop has no effect on the bug) -- for ifort 18.0.3.222.

[In the full code, ND tends to be around 800.]  Of course, the full code contains multiple loops pretty similar to this (though for whatever reason, just the one appears troublesome -- as far as I know).

Why did you suspect this directive would fix the problem?  And do you know why the compiler options "-no-vec -no-simd" didn't have the same effect?

I'm not sure if this adds useful information, but with -qopt-report=5, ifort 18 and 19 yield identical results except for the inner loop (though all say "loop was not vectorized"):

(ifort 18 - bug)

   LOOP BEGIN at mod_initial.f90(29,3)
      remark #25045: Fused Loops: ( 29 30 32 )

      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed OUTPUT dependence between irrelra1(:) (30:3) and irrelra1(:) (30:3)
      remark #15346: vector dependence: assumed OUTPUT dependence between irrelra1(:) (30:3) and irrelra1(:) (30:3)
   LOOP END

(ifort 19 - no bug: note change from irrelra1 to irrelra2 in remark #15346)

   LOOP BEGIN at mod_initial.f90(29,3)
      remark #25045: Fused Loops: ( 29 30 32 )

      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed OUTPUT dependence between irrelra2(:) (29:3) and irrelra2(:) (29:3)
      remark #15346: vector dependence: assumed OUTPUT dependence between irrelra2(:) (29:3) and irrelra2(:) (29:3)
   LOOP END

(ifort 18 with !DIR$ NOVECTOR for inner loop - no bug)

   LOOP BEGIN at mod_initial.f90(30,3)
      remark #25045: Fused Loops: ( 30 32 )

      remark #15319: loop was not vectorized: novector directive used
   LOOP END

   LOOP BEGIN at mod_initial.f90(32,3)
      remark #25046: Loop lost in Fusion
   LOOP END

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,028 Views

>>Why did you suspect this directive would fix the problem?

This was more of a hunch than anything else. Over 50 years of programming gives me plenty of hunches.The structure was a nested loop (DO I...DO J...) where the compiler could potentially nest-collapsed into a single loop. This has shown, in earlier versions of the compiler, to be problematic with regard to vectorization especially with the loop index being augmented with a + or - offset.

 Note the opt-report states "Fused Loops" which is (grammatically) incorrect. Either this is a mis-statement, or the compiler is in the wrong section of code.

>> do you know why the compiler options "-no-vec -no-simd" didn't have the same effect?

I do not think it specifically has to do with vectorization, but rather a case of the loop collapsing. The compiler does not have a directive to instruct NOCOLLAPSE and the nearest thing to accomplish this was to insert the NOVECTOR.

BTW in cases like this, especially where the inner loop count is ~800, a different workaround strategy is to export the code to a Callable subroutine. And if necessary don't IPO it.

Jim Dempsey

 

0 Kudos
Reply