- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I encountered a strange behavior.
When there exists
1. a DO LOOP index which has as an increment parameter of a variable,
2. reference to the DO LOOP index after the LOOP,
the object compiled in x64 RELEASE mode runs much slower than that in win32 RELEASE mode.
Here is a minimal sample program.
[fortran]PROGRAM x64_Release_runs_slow IMPLICIT NONE INTEGER :: i, j, k REAL :: t0, t1 CALL CPU_TIME(t0) ! k = 1 DO i = 1, 10**8 DO j = 1, 10**2, k ! 1. use variable k as an increment parameter ! END DO END DO PRINT *, j ! 2. reference to the loop index j ! CALL CPU_TIME(t1) PRINT *, t1 - t0 STOP END PROGRAM x64_Release_runs_slow[/fortran]
This sample takes a few seconds in x64 RELEASE mode, while practically 0 seconds in win32 RELEASE mode.
This singularity disappears when changing k to constant 1 or commenting out the line "PRINT *, j".
In DEBUG mode x64 and win32 run with almost the same cpu_time.
I suppose this might be a optimization problem.
I attach a more realistic program with which I encountered this problem. (Option/assume:realloc_lhs is required.) In this case x64 version is ~40% slower than win32.
Yamajun
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nobody interested?
Here is a simpler example. There are 4 semantically identical DO LOOPs. Only the second one runs slow.
Result: x64 RELEASE
[plain] 2000000001 0.0000000E+00 2000000001 0.3588023 2000000001 0.0000000E+00 2000000001 0.0000000E+00[/plain]
[fortran]PROGRAM test IMPLICIT NONE INTEGER :: i, j, k INTEGER(8) :: jj REAL :: t0, t1 CALL CPU_TIME(t0) DO j = 1, 2 * 10**9 ! END DO CALL CPU_TIME(t1) PRINT *, j, t1 - t0 ! ! k = 1 CALL CPU_TIME(t0) DO j = 1, 2 * 10**9, k ! END DO CALL CPU_TIME(t1) PRINT *, j, t1 - t0 ! ! CALL CPU_TIME(t0) DO j = 1, 2 * 10**9, 1 ! END DO CALL CPU_TIME(t1) PRINT *, j, t1 - t0 ! ! k = 1 CALL CPU_TIME(t0) DO jj = 1, 2 * 10**9, k ! END DO CALL CPU_TIME(t1) PRINT *, jj, t1 - t0 ! ! STOP END PROGRAM test[/fortran]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
None of these loops do anything. The compiler probably optimizes some away and not others - it is not a useful test program.
Try again with a more realistic program - remembering that the optimizer is smarter than you might think.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just want to add my experience. Do you have loop unrolling / changed the threshold for auto-parallelization ? Loop unrolling might slow down some specific segments of code when used with O3.In otherwods agressive optimization results are sometimes code specific.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve,
Yes, these loops do nothing and I know the optimizer is quite clever.
I met this phenomena in more realistic program which I attached in my first post.
But anyway I found out a reason. It was a vectorizer. (Thanks ragu, you gave me a hint.) In x64 RELEASE, DO LOOP with increment parameter variable.
There used be info messages when LOOPs were vectorized. So I thought there was no vectorization nor parallelization.
Here is another simple example.
[fortran]PROGRAM x64_Release_runs_slow IMPLICIT NONE INTEGER :: i, j, k REAL :: s, t0, t1 k = 1 CALL CPU_TIME(t0) DO i = 1, 10**5 s = 0.0 DO j = 0, 10**4, k s = s + REAL(j) END DO END DO CALL CPU_TIME(t1) PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s CALL CPU_TIME(t0) DO i = 1, 10**5 s = 0.0 DO j = 0, 10**4 s = s + REAL(j) END DO END DO CALL CPU_TIME(t1) PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s STOP END PROGRAM x64_Release_runs_slow[/fortran]
Win32 compiler vectorizer message
[plain]1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop. 1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: LOOP WAS VECTORIZED. 1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop. 1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED. [/plain]
win32 output
[plain] time = 0.2340015 n(n+1)/2= 5.0005000E+07 calc. 5.0005000E+07 time = 0.1716011 n(n+1)/2= 5.0005000E+07 calc. 5.0005000E+07[/plain]
x64 compiler vectorizer message
[bash]1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop. 1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: loop was not vectorized: existence of vector dependence. 1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop. 1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED. [/bash]
x64 output
[plain] time = 1.201208 n(n+1)/2= 5.0005000E+07 calc. 5.0002896E+07 time = 0.1872011 n(n+1)/2= 5.0005000E+07 calc. 5.0005000E+07[/plain]
The first loop is slow and the sum is not correct.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree, it looks like x64 compilation does not recognize
k=1
DO i=...
DO j=s,e,k
as having k as "fixed at 1"
It looks like an area where the optimizer missed an opportunity.
Jim

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page