topic x64 object runs much slower than win32 object when do loop has in Intel® Fortran Compiler

x64 object runs much slower than win32 object when do loop has increment parameter

yamajun2 — Fri, 26 Feb 2010 18:32:02 GMT

Hi,

I encountered a strange behavior.

When there exists

1. a DO LOOP index which has as an increment parameter of a variable,

2. reference to the DO LOOP index after the LOOP,

the object compiled in x64 RELEASE mode runs much slower than that in win32 RELEASE mode.

Here is a minimal sample program.

[fortran]PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: t0, t1

  CALL CPU_TIME(t0)
!
  k = 1
  DO i = 1, 10**8
   DO j = 1, 10**2, k  ! 1. use variable k as an increment parameter  
    !                  
   END DO
  END DO
  PRINT *, j           ! 2. reference to the loop index j
!
  CALL CPU_TIME(t1)
  PRINT *, t1 - t0 

  STOP
END PROGRAM x64_Release_runs_slow[/fortran]

This sample takes a few seconds in x64 RELEASE mode, while practically 0 seconds in win32 RELEASE mode.

This singularity disappears when changing k to constant 1 or commenting out the line "PRINT *, j".

In DEBUG mode x64 and win32 run with almost the same cpu_time.

I suppose this might be a optimization problem.

I attach a more realistic program with which I encountered this problem. (Option/assume:realloc_lhs is required.) In this case x64 version is ~40% slower than win32.

Yamajun

x64 object runs much slower than win32 object when do loop has

yamajun2 — Sun, 28 Feb 2010 05:38:20 GMT

Nobody interested?

Here is a simpler example. There are 4 semantically identical DO LOOPs. Only the second one runs slow.

Result: x64 RELEASE

[plain]  2000000001  0.0000000E+00
  2000000001  0.3588023
  2000000001  0.0000000E+00
            2000000001  0.0000000E+00[/plain]

[fortran]PROGRAM test
  IMPLICIT NONE
  INTEGER :: i, j, k
  INTEGER(8) :: jj
  REAL :: t0, t1
  
  CALL CPU_TIME(t0) 
  DO j = 1, 2 * 10**9
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
!
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, 1
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO jj = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, jj, t1 - t0
!
!
  STOP
END PROGRAM test[/fortran]

x64 object runs much slower than win32 object when do loop has

Steven_L_Intel1 — Sun, 28 Feb 2010 23:46:42 GMT

None of these loops do anything. The compiler probably optimizes some away and not others - it is not a useful test program.

Try again with a more realistic program - remembering that the optimizer is smarter than you might think.

x64 object runs much slower than win32 object when do loop has

rasa — Mon, 01 Mar 2010 14:03:51 GMT

Just want to add my experience. Do you have loop unrolling / changed the threshold for auto-parallelization ? Loop unrolling might slow down some specific segments of code when used with O3.In otherwods agressive optimization results are sometimes code specific.

x64 object runs much slower than win32 object when do loop has

yamajun2 — Mon, 01 Mar 2010 18:00:00 GMT

Steve,

Yes, these loops do nothing and I know the optimizer is quite clever.

I met this phenomena in more realistic program which I attached in my first post.

But anyway I found out a reason. It was a vectorizer. (Thanks ragu, you gave me a hint.) In x64 RELEASE, DO LOOP with increment parameter variable.

There used be info messages when LOOPs were vectorized. So I thought there was no vectorization nor parallelization.

Here is another simple example.

[fortran]PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: s, t0, t1

  k = 1
  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4, k
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 


  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 
 

  STOP
END PROGRAM x64_Release_runs_slow[/fortran]

Win32 compiler vectorizer message

[plain]1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: LOOP WAS VECTORIZED.
1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.
[/plain]

win32 output

[plain] time =  0.2340015     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07
 time =  0.1716011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07[/plain]

x64 compiler vectorizer message

[bash]1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: loop was not vectorized: existence of vector dependence.
1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.
[/bash]

x64 output

[plain] time =   1.201208     n(n+1)/2=  5.0005000E+07 calc.  5.0002896E+07
 time =  0.1872011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07[/plain]

The first loop is slow and the sum is not correct.

x64 object runs much slower than win32 object when do loop has

jimdempseyatthecove — Mon, 01 Mar 2010 21:55:39 GMT

I agree, it looks like x64 compilation does not recognize

k=1
DO i=...
DO j=s,e,k

as having k as "fixed at 1"

It looks like an area where the optimizer missed an opportunity.

Jim