x64 object runs much slower than win32 object when do loop has increment parameter

yamajun2 · ‎02-26-2010

Hi,

I encountered a strange behavior.

When there exists

1. a DO LOOP index which has as an increment parameter of a variable,

2. reference to the DO LOOP index after the LOOP,

the object compiled in x64 RELEASE mode runs much slower than that in win32 RELEASE mode.

Here is a minimal sample program.

[fortran]PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: t0, t1

  CALL CPU_TIME(t0)
!
  k = 1
  DO i = 1, 10**8
   DO j = 1, 10**2, k  ! 1. use variable k as an increment parameter  
    !                  
   END DO
  END DO
  PRINT *, j           ! 2. reference to the loop index j
!
  CALL CPU_TIME(t1)
  PRINT *, t1 - t0 

  STOP
END PROGRAM x64_Release_runs_slow[/fortran]

This sample takes a few seconds in x64 RELEASE mode, while practically 0 seconds in win32 RELEASE mode.

This singularity disappears when changing k to constant 1 or commenting out the line "PRINT *, j".

In DEBUG mode x64 and win32 run with almost the same cpu_time.

I suppose this might be a optimization problem.

I attach a more realistic program with which I encountered this problem. (Option/assume:realloc_lhs is required.) In this case x64 version is ~40% slower than win32.

Yamajun

yamajun2 · ‎02-27-2010

Nobody interested?

Here is a simpler example. There are 4 semantically identical DO LOOPs. Only the second one runs slow.

Result: x64 RELEASE

[plain]  2000000001  0.0000000E+00
  2000000001  0.3588023
  2000000001  0.0000000E+00
            2000000001  0.0000000E+00[/plain]

[fortran]PROGRAM test
  IMPLICIT NONE
  INTEGER :: i, j, k
  INTEGER(8) :: jj
  REAL :: t0, t1
  
  CALL CPU_TIME(t0) 
  DO j = 1, 2 * 10**9
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
!
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, 1
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO jj = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, jj, t1 - t0
!
!
  STOP
END PROGRAM test[/fortran]

Steven_L_Intel1 · ‎02-28-2010

None of these loops do anything. The compiler probably optimizes some away and not others - it is not a useful test program.

Try again with a more realistic program - remembering that the optimizer is smarter than you might think.

rasa · ‎03-01-2010

Just want to add my experience. Do you have loop unrolling / changed the threshold for auto-parallelization ? Loop unrolling might slow down some specific segments of code when used with O3.In otherwods agressive optimization results are sometimes code specific.

yamajun2 · ‎03-01-2010

Steve,

Yes, these loops do nothing and I know the optimizer is quite clever.

I met this phenomena in more realistic program which I attached in my first post.

But anyway I found out a reason. It was a vectorizer. (Thanks ragu, you gave me a hint.) In x64 RELEASE, DO LOOP with increment parameter variable.

There used be info messages when LOOPs were vectorized. So I thought there was no vectorization nor parallelization.

Here is another simple example.

[fortran]PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: s, t0, t1

  k = 1
  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4, k
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 


  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 
 

  STOP
END PROGRAM x64_Release_runs_slow[/fortran]

Win32 compiler vectorizer message

[plain]1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: LOOP WAS VECTORIZED.
1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.
[/plain]

win32 output

[plain] time =  0.2340015     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07
 time =  0.1716011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07[/plain]

x64 compiler vectorizer message

[bash]1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: loop was not vectorized: existence of vector dependence.
1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.
[/bash]

x64 output

[plain] time =   1.201208     n(n+1)/2=  5.0005000E+07 calc.  5.0002896E+07
 time =  0.1872011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07[/plain]

The first loop is slow and the sum is not correct.

jimdempseyatthecove · ‎03-01-2010

I agree, it looks like x64 compilation does not recognize

k=1
DO i=...
DO j=s,e,k

as having k as "fixed at 1"

It looks like an area where the optimizer missed an opportunity.

Jim