<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic x64 object runs much slower than win32 object when do loop has  in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883631#M76052</link>
    <description>&lt;P&gt;Nobody interested?&lt;/P&gt;
&lt;P&gt;Here is a simpler example. There are 4 semantically identical DO LOOPs. Only the second one runs slow.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Result: x64 RELEASE&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[plain]  2000000001  0.0000000E+00
  2000000001  0.3588023
  2000000001  0.0000000E+00
            2000000001  0.0000000E+00[/plain]&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[fortran]PROGRAM test
  IMPLICIT NONE
  INTEGER :: i, j, k
  INTEGER(8) :: jj
  REAL :: t0, t1
  
  CALL CPU_TIME(t0) 
  DO j = 1, 2 * 10**9
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
!
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, 1
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO jj = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, jj, t1 - t0
!
!
  STOP
END PROGRAM test[/fortran]&lt;/PRE&gt;
&lt;BR /&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 28 Feb 2010 05:38:20 GMT</pubDate>
    <dc:creator>yamajun2</dc:creator>
    <dc:date>2010-02-28T05:38:20Z</dc:date>
    <item>
      <title>x64 object runs much slower than win32 object when do loop has increment parameter</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883630#M76051</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I encountered a strange behavior.&lt;/P&gt;
&lt;P&gt;When there exists&lt;/P&gt;
&lt;P&gt;1. a DO LOOP index which has as an increment parameter of a variable,&lt;/P&gt;
&lt;P&gt;2. reference to the DO LOOP index after the LOOP,&lt;/P&gt;
&lt;P&gt;the object compiled in x64 RELEASE mode runs much slower than that in win32 RELEASE mode.&lt;/P&gt;
&lt;P&gt;Here is a minimal sample program.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[fortran]PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: t0, t1

  CALL CPU_TIME(t0)
!
  k = 1
  DO i = 1, 10**8
   DO j = 1, 10**2, k  ! 1. use variable k as an increment parameter  
    !                  
   END DO
  END DO
  PRINT *, j           ! 2. reference to the loop index j
!
  CALL CPU_TIME(t1)
  PRINT *, t1 - t0 

  STOP
END PROGRAM x64_Release_runs_slow[/fortran]&lt;/PRE&gt;
&lt;BR /&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;This sample takes a few seconds in x64 RELEASE mode, while practically 0 seconds in win32 RELEASE mode.&lt;/P&gt;
&lt;P&gt;This singularity disappears when changing k to constant 1 or commenting out the line "PRINT *, j".&lt;/P&gt;
&lt;P&gt;In DEBUG mode x64 and win32 run with almost the same cpu_time.&lt;/P&gt;
&lt;P&gt;I suppose this might be a optimization problem.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I attach a more realistic program with which I encountered this problem. (Option/assume:realloc_lhs is required.) In this case x64 version is ~40% slower than win32.&lt;/P&gt;
&lt;P&gt;Yamajun&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Feb 2010 18:32:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883630#M76051</guid>
      <dc:creator>yamajun2</dc:creator>
      <dc:date>2010-02-26T18:32:02Z</dc:date>
    </item>
    <item>
      <title>x64 object runs much slower than win32 object when do loop has</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883631#M76052</link>
      <description>&lt;P&gt;Nobody interested?&lt;/P&gt;
&lt;P&gt;Here is a simpler example. There are 4 semantically identical DO LOOPs. Only the second one runs slow.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Result: x64 RELEASE&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[plain]  2000000001  0.0000000E+00
  2000000001  0.3588023
  2000000001  0.0000000E+00
            2000000001  0.0000000E+00[/plain]&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[fortran]PROGRAM test
  IMPLICIT NONE
  INTEGER :: i, j, k
  INTEGER(8) :: jj
  REAL :: t0, t1
  
  CALL CPU_TIME(t0) 
  DO j = 1, 2 * 10**9
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
!
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, 1
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO jj = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, jj, t1 - t0
!
!
  STOP
END PROGRAM test[/fortran]&lt;/PRE&gt;
&lt;BR /&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 28 Feb 2010 05:38:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883631#M76052</guid>
      <dc:creator>yamajun2</dc:creator>
      <dc:date>2010-02-28T05:38:20Z</dc:date>
    </item>
    <item>
      <title>x64 object runs much slower than win32 object when do loop has</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883632#M76053</link>
      <description>&lt;P&gt;None of these loops do anything. The compiler probably optimizes some away and not others - it is not a useful test program.&lt;/P&gt;
&lt;P&gt;Try again with a more realistic program - remembering that the optimizer is smarter than you might think.&lt;/P&gt;</description>
      <pubDate>Sun, 28 Feb 2010 23:46:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883632#M76053</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2010-02-28T23:46:42Z</dc:date>
    </item>
    <item>
      <title>x64 object runs much slower than win32 object when do loop has</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883633#M76054</link>
      <description>&lt;P&gt;Just want to add my experience. Do you have loop unrolling / changed the threshold for auto-parallelization ? Loop unrolling might slow down some specific segments of code when used with O3.In otherwods agressive optimization results are sometimes code specific.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Mar 2010 14:03:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883633#M76054</guid>
      <dc:creator>rasa</dc:creator>
      <dc:date>2010-03-01T14:03:51Z</dc:date>
    </item>
    <item>
      <title>x64 object runs much slower than win32 object when do loop has</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883634#M76055</link>
      <description>&lt;P&gt;Steve,&lt;/P&gt;
&lt;P&gt;Yes, these loops do nothing and I know the optimizer is quite clever.&lt;/P&gt;
&lt;P&gt;I met this phenomena in more realistic program which I attached in my first post.&lt;/P&gt;
&lt;P&gt;But anyway I found out a reason. It was a vectorizer. (Thanks ragu, you gave me a hint.) In x64 RELEASE, DO LOOP with increment parameter variable.&lt;/P&gt;
&lt;P&gt;There used be info messages when LOOPs were vectorized. So I thought there was no vectorization nor parallelization.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Here is another simple example.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[fortran]PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: s, t0, t1

  k = 1
  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4, k
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 


  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 
 

  STOP
END PROGRAM x64_Release_runs_slow[/fortran]&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Win32 compiler vectorizer message&lt;/P&gt;
&lt;PRE&gt;[plain]1&amp;gt;C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1&amp;gt;C:FortransortConsole1Console1.f90(11): (col. 4) remark: LOOP WAS VECTORIZED.
1&amp;gt;C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1&amp;gt;C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.
[/plain]&lt;/PRE&gt;
&lt;BR /&gt;
&lt;P&gt;win32 output&lt;/P&gt;
&lt;PRE&gt;[plain] time =  0.2340015     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07
 time =  0.1716011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07[/plain]&lt;/PRE&gt;
&lt;BR /&gt;
&lt;P&gt;x64 compiler vectorizer message&lt;/P&gt;
&lt;PRE&gt;[bash]1&amp;gt;C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1&amp;gt;C:FortransortConsole1Console1.f90(11): (col. 4) remark: loop was not vectorized: existence of vector dependence.
1&amp;gt;C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1&amp;gt;C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.
[/bash]&lt;/PRE&gt;
&lt;P&gt;x64 output&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[plain] time =   1.201208     n(n+1)/2=  5.0005000E+07 calc.  5.0002896E+07
 time =  0.1872011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07[/plain]&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The first loop is slow and the sum is not correct.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Mar 2010 18:00:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883634#M76055</guid>
      <dc:creator>yamajun2</dc:creator>
      <dc:date>2010-03-01T18:00:00Z</dc:date>
    </item>
    <item>
      <title>x64 object runs much slower than win32 object when do loop has</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883635#M76056</link>
      <description>&lt;P&gt;I agree, it looks like x64 compilation does not recognize&lt;/P&gt;
&lt;P&gt;k=1&lt;BR /&gt;DO i=...&lt;BR /&gt;DO j=s,e,k&lt;/P&gt;
&lt;P&gt;as having k as "fixed at 1"&lt;/P&gt;
&lt;P&gt;It looks like an area where the optimizer missed an opportunity.&lt;/P&gt;
&lt;P&gt;Jim&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Mar 2010 21:55:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/x64-object-runs-much-slower-than-win32-object-when-do-loop-has/m-p/883635#M76056</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2010-03-01T21:55:39Z</dc:date>
    </item>
  </channel>
</rss>

