Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Slow allocatable arrays

brovchik
Beginner
3,736 Views

Dear all. I faced with problem that program works with large allocatable arrays much slower then with static arrays. Below is simple code that initialize large array. In case of static arrays this code works 10! times faster.
I'm using Intel Fortran Compiler 10.0 under Windows with 2GB RAM.

Anybody now what is the reason and what to do to make allocatable arrays work faster?

!integer, parameter :: NP=10000000
integer NP
real, allocatable :: X(:),Y(:)
!real X(10000000),Y(10000000)
integer i,k,ist,iend,icountrate

NP = 10000000;
allocate(X(NP),Y(NP))

do k = 1, 100
do i = 1, NP
X(i) = 0.
Y(i) = 0.
enddo
enddo

0 Kudos
27 Replies
TimP
Honored Contributor III
2,976 Views
Quoting - brovchik

Dear all. I faced with problem that program works with large allocatable arrays much slower then with static arrays. Below is simple code that initialize large array. In case of static arrays this code works 10! times faster.
I'm using Intel Fortran Compiler 10.0 under Windows with 2GB RAM.

Anybody now what is the reason and what to do to make allocatable arrays work faster?

!integer, parameter :: NP=10000000
integer NP
real, allocatable :: X(:),Y(:)
!real X(10000000),Y(10000000)
integer i,k,ist,iend,icountrate

NP = 10000000;
allocate(X(NP),Y(NP))

do k = 1, 100
do i = 1, NP
X(i) = 0.
Y(i) = 0.
enddo
enddo

0 Kudos
TimP
Honored Contributor III
2,976 Views

Did you check whether your outer loop is being executed in both cases? For years, many people have preferred compilers which can shortcut repetitions such as this. On a browser search, you will see 20 year old examples of precautions taken in artificial benchmarks, to prevent a compiler optimizing away extra loops.

0 Kudos
Steve_Nuchia
New Contributor I
2,976 Views
That example is short enough that you should be able to look at the generated assembly and see the differences. If you can't read assembly try posing both versions here.
Set a breakpoint in the loop and debug to it; right-click, got to disassembly. copy the surrounding several dozen lines to the clipboard and paste them in.
You don't say what kind of processor you're using. An unfortunate cache stride or some kind of prefetch disagreement seems unlikely with such a simple example but it's a remote possibility.
Your arrays are large enough that the performance of this code will be determined by how well the cache is managed. Optimal code will be right at 10x better than bad code, on current x64 hardware. (using write combining, no-fetch sequences versus actually updating individual cells one by one).
Other remote possibilities include: alignment problems. NaNs in the heap. Even more exotic stuff I can't imagine right now.
Run both under VTune and see what's going on.
0 Kudos
Steve_Nuchia
New Contributor I
2,976 Views
If the outer loop is being optimized away in both case, the most likely explanation is that you are counting the Heap allocation overhead in one case and you are not counting the loader's allocation of the static arrays in the other. How are you measuing elapsed time? What are the elapsed times and what is your hardware?

0 Kudos
brovchik
Beginner
2,976 Views
If the outer loop is being optimized away in both case, the most likely explanation is that you are counting the Heap allocation overhead in one case and you are not counting the loader's allocation of the static arrays in the other. How are you measuing elapsed time? What are the elapsed times and what is your hardware?

Thanks a lot for the answer! I'm using Intel Core 2 CPU 2.4GHz. Time i measure using system_time(ist,icount_rate) routine. But the time difference is very well seen by eye. Under release configuration i have time 4.04s with allocatable arrays and 0.42s with static array. After you said about debugging I test it again under debug configuration and got times 9.18s and 5.8s. Not so extreme difference but still significant.

All compiler option were default. Any changes that i could imagine in compiler options did not help. I tried also to use diferent options /heap-arrays[:size] but did not succeed. I have a feeling that something wrong with acces to the memory but can not understand what to do. Task Manager says that I have 80Mb Memory usage in case of allocatable arays (wich is normal because I have 2*10^7 real*4) and only 1.6Mb in case of static array.

0 Kudos
brovchik
Beginner
2,976 Views
That example is short enough that you should be able to look at the generated assembly and see the differences. If you can't read assembly try posing both versions here.
Set a breakpoint in the loop and debug to it; right-click, got to disassembly. copy the surrounding several dozen lines to the clipboard and paste them in.
You don't say what kind of processor you're using. An unfortunate cache stride or some kind of prefetch disagreement seems unlikely with such a simple example but it's a remote possibility.
Your arrays are large enough that the performance of this code will be determined by how well the cache is managed. Optimal code will be right at 10x better than bad code, on current x64 hardware. (using write combining, no-fetch sequences versus actually updating individual cells one by one).
Other remote possibilities include: alignment problems. NaNs in the heap. Even more exotic stuff I can't imagine right now.
Run both under VTune and see what's going on.

I can not read dissasembly, so, i'm tring to show it here. I heard also (do not remember where) that for allocatable arrays program perform some aditional checks about memory storage adresses. May be it can be seen on dissasembly:

so, allocatable variant:

implicit none

!integer, parameter :: NP=10000000
integer NP
real, allocatable :: X(:),Y(:)
!real X(10000000),Y(10000000)
integer i,j,k,ist,iend,icountrate


NP = 10000000;
0040101F mov dword ptr [ebp-10h],989680h
allocate(X(NP),Y(NP))
00401026 mov eax,dword ptr [X+0Ch (4D200Ch)]
0040102B or eax,9000000h
00401030 mov dword ptr [X+0Ch (4D200Ch)],eax
00401035 mov dword ptr [X+4 (4D2004h)],4
0040103F mov eax,1
00401044 mov dword ptr [X+10h (4D2010h)],eax
00401049 mov dword ptr [X+20h (4D2020h)],eax
0040104E mov eax,dword ptr [NP]
00401051 test eax,eax
00401053 jg TEST+5Eh (40105Eh)
00401055 mov dword ptr [ebp-34h],0
0040105C jmp TEST+64h (401064h)
0040105E mov eax,dword ptr [NP]
00401061 mov dword ptr [ebp-34h],eax
00401064 mov eax,dword ptr [ebp-34h]
00401067 mov dword ptr [X+18h (4D2018h)],eax
0040106C mov edx,4
00401071 mov dword ptr [X+1Ch (4D201Ch)],edx
00401077 add esp,0FFFFFFF0h
0040107A lea ecx,[ebp-48h]
0040107D mov dword ptr [esp],ecx
00401080 mov dword ptr [esp+4],2
00401088 mov dword ptr [esp+8],eax
0040108C mov dword ptr [esp+0Ch],edx
00401090 call _for_check_mult_overflow (40252Ch)
00401095 add esp,10h
00401098 mov dword ptr [ebp-30h],eax
0040109B add esp,0FFFFFFF4h
0040109E mov eax,dword ptr [ebp-48h]
004010A1 mov dword ptr [esp],eax
004010A4 mov dword ptr [esp+4],offset X (4D2000h)
004010AC mov eax,dword ptr [X+0Ch (4D200Ch)]
004010B1 and eax,1
004010B4 add eax,eax
004010B6 and eax,0FFFFFFEFh
004010B9 mov edx,dword ptr [ebp-30h]
004010BC and edx,1
004010BF shl edx,4
004010C2 or eax,edx
004010C4 mov dword ptr [esp+8],eax
004010C8 call _for_alloc_allocatable (402878h)
004010CD add esp,0Ch
004010D0 mov dword ptr [X+0Ch (4D200Ch)],5
004010DA mov eax,dword ptr [X+20h (4D2020h)]
004010DF shl eax,2
004010E2 neg eax
004010E4 mov dword ptr [X+8 (4D2008h)],eax
004010E9 mov eax,dword ptr [Y+0Ch (4D2034h)]
004010EE or eax,9000000h
004010F3 mov dword ptr [Y+0Ch (4D2034h)],eax
004010F8 mov dword ptr [Y+4 (4D202Ch)],4
00401102 mov eax,1
00401107 mov dword ptr [Y+10h (4D2038h)],eax
0040110C mov dword ptr [Y+20h (4D2048h)],eax
00401111 mov eax,dword ptr [NP]
00401114 test eax,eax
00401116 jg TEST+121h (401121h)
00401118 mov dword ptr [ebp-2Ch],0
0040111F jmp TEST+127h (401127h)
00401121 mov eax,dword ptr [NP]
00401124 mov dword ptr [ebp-2Ch],eax
00401127 mov eax,dword ptr [ebp-2Ch]
0040112A mov dword ptr [Y+18h (4D2040h)],eax
0040112F mov edx,4
00401134 mov dword ptr [Y+1Ch (4D2044h)],edx
0040113A add esp,0FFFFFFF0h
0040113D lea ecx,[ebp-44h]
00401140 mov dword ptr [esp],ecx
00401143 mov dword ptr [esp+4],2
0040114B mov dword ptr [esp+8],eax
0040114F mov dword ptr [esp+0Ch],edx
00401153 call _for_check_mult_overflow (40252Ch)
00401158 add esp,10h
0040115B mov dword ptr [ebp-28h],eax
0040115E add esp,0FFFFFFF4h
00401161 mov eax,dword ptr [ebp-44h]
00401164 mov dword ptr [esp],eax
00401167 mov dword ptr [esp+4],offset Y (4D2028h)
0040116F mov eax,dword ptr [Y+0Ch (4D2034h)]
00401174 and eax,1
00401177 add eax,eax
00401179 and eax,0FFFFFFEFh
0040117C mov edx,dword ptr [ebp-28h]
0040117F and edx,1
00401182 shl edx,4
00401185 or eax,edx
00401187 mov dword ptr [esp+8],eax
0040118B call _for_alloc_allocatable (402878h)
00401190 add esp,0Ch
00401193 mov dword ptr [Y+0Ch (4D2034h)],5
0040119D mov eax,dword ptr [Y+20h (4D2048h)]
004011A2 shl eax,2
004011A5 neg eax
004011A7 mov dword ptr [Y+8 (4D2030h)],eax

call system_clock(ist,icountrate)
004011AC push edi
004011AD mov dword ptr [esp],4
004011B4 call _for_system_clock_count (402AF0h)
004011B9 pop ecx
004011BA mov dword ptr [ebp-24h],eax
004011BD mov eax,dword ptr [ebp-24h]
004011C0 mov dword ptr [IST],eax
004011C3 push edi
004011C4 mov dword ptr [esp],4
004011CB call _for_system_clock_rate (402AB0h)
004011D0 pop ecx
004011D1 mov dword ptr [ebp-20h],eax
004011D4 mov eax,dword ptr [ebp-20h]
004011D7 mov dword ptr [ICOUNTRATE],eax


do k = 1, 100
004011DA mov dword ptr ,1

do i = 1, NP
004011E1 mov eax,dword ptr [NP]
004011E4 mov dword ptr [ebp-8],eax
004011E7 mov dword ptr ,1
004011EE mov eax,dword ptr [ebp-8]
004011F1 test eax,eax
004011F3 jle TEST+3C4h (4013C4h)
X(i) = 1.
004011F9 mov eax,dword ptr
004011FC mov edx,dword ptr [X+20h (4D2020h)]
00401202 cmp eax,edx
00401204 jge TEST+250h (401250h)
00401206 add esp,0FFFFFFE0h
00401209 mov dword ptr [esp],10100003h
00401210 mov dword ptr [esp+4],offset ___xt_z+58h (4B62A0h)
00401218 mov dword ptr [esp+8],5
00401220 mov dword ptr [esp+0Ch],3
00401228 mov dword ptr [esp+10h],1
00401230 mov dword ptr [esp+14h],offset ___xt_z+144h (4B638Ch)
00401238 mov eax,dword ptr
0040123B mov dword ptr [esp+18h],eax
0040123F mov eax,dword ptr [X+20h (4D2020h)]
00401244 mov dword ptr [esp+1Ch],eax
00401248 call _for_emit_diagnostic (403224h)
0040124D add esp,20h
00401250 mov eax,dword ptr [X+20h (4D2020h)]
00401255 mov edx,dword ptr [X+18h (4D2018h)]
0040125B lea eax,[eax+edx-1]
0040125F mov edx,dword ptr
00401262 cmp edx,eax
00401264 jle TEST+2BAh (4012BAh)
00401266 add esp,0FFFFFFE0h
00401269 mov dword ptr [esp],10100002h
00401270 mov dword ptr [esp+4],offset ___xt_z+0D8h (4B6320h)
00401278 mov dword ptr [esp+8],5
00401280 mov dword ptr [esp+0Ch],2
00401288 mov dword ptr [esp+10h],1
00401290 mov dword ptr [esp+14h],offset ___xt_z+148h (4B6390h)
00401298 mov eax,dword ptr
0040129B mov dword ptr [esp+18h],eax
0040129F mov eax,dword ptr [X+20h (4D2020h)]
004012A4 mov edx,dword ptr [X+18h (4D2018h)]
004012AA lea eax,[eax+edx-1]
004012AE mov dword ptr [esp+1Ch],eax
004012B2 call _for_emit_diagnostic (403224h)
004012B7 add esp,20h
004012BA fld1
004012BC mov eax,dword ptr
004012BF mov edx,dword ptr [X (4D2000h)]
004012C5 lea eax,[edx+eax*4]
004012C8 mov edx,dword ptr [X+20h (4D2020h)]
004012CE shl edx,2
004012D1 neg edx
004012D3 fstp dword ptr [edx+eax]

and static variant:

implicit none

integer, parameter :: NP=10000000
!integer NP
!real, allocatable :: X(:),Y(:)
real X(10000000),Y(10000000)
integer i,j,k,ist,iend,icountrate


!NP = 10000000;
!allocate(X(NP),Y(NP))

call system_clock(ist,icountrate)
0040101F push edi
00401020 mov dword ptr [esp],4
00401027 call _for_system_clock_count (402280h)
0040102C pop ecx
0040102D mov dword ptr [ebp-1Ch],eax
00401030 mov eax,dword ptr [ebp-1Ch]
00401033 mov dword ptr [IST],eax
00401036 push edi
00401037 mov dword ptr [esp],4
0040103E call _for_system_clock_rate (402240h)
00401043 pop ecx
00401044 mov dword ptr [ebp-18h],eax
00401047 mov eax,dword ptr [ebp-18h]
0040104A mov dword ptr [ICOUNTRATE],eax


do k = 1, 100
0040104D mov dword ptr ,1

do i = 1, NP
00401054 mov dword ptr ,1
X(i) = 1.
0040105B mov eax,dword ptr
0040105E test eax,eax
00401060 jg TEST+0A8h (4010A8h)
00401062 add esp,0FFFFFFE0h
00401065 mov dword ptr [esp],10100003h
0040106C mov dword ptr [esp+4],offset ___xt_z+58h (4B62A0h)
00401074 mov dword ptr [esp+8],5
0040107C mov dword ptr [esp+0Ch],3
00401084 mov eax,1
00401089 mov dword ptr [esp+10h],eax
0040108D mov dword ptr [esp+14h],offset ___xt_z+144h (4B638Ch)
00401095 mov edx,dword ptr
00401098 mov dword ptr [esp+18h],edx
0040109C mov dword ptr [esp+1Ch],eax
004010A0 call _for_emit_diagnostic (4029B4h)
004010A5 add esp,20h
004010A8 mov eax,dword ptr
004010AB cmp eax,989680h
004010B0 jle TEST+0FBh (4010FBh)
004010B2 add esp,0FFFFFFE0h
004010B5 mov dword ptr [esp],10100002h
004010BC mov dword ptr [esp+4],offset ___xt_z+0D8h (4B6320h)
004010C4 mov dword ptr [esp+8],5
004010CC mov dword ptr [esp+0Ch],2
004010D4 mov dword ptr [esp+10h],1
004010DC mov dword ptr [esp+14h],offset ___xt_z+148h (4B6390h)
004010E4 mov eax,dword ptr
004010E7 mov dword ptr [esp+18h],eax
004010EB mov dword ptr [esp+1Ch],989680h
004010F3 call _for_emit_diagnostic (4029B4h)
004010F8 add esp,20h
004010FB fld1
004010FD mov eax,dword ptr
00401100 fstp dword ptr TWO_TO_M1536A+8 (4D577Ch)[eax*4]
Y(i) = 1.

0 Kudos
Steven_L_Intel1
Employee
2,976 Views

You have array bounds checking on - that's most of the code. Try turning it off. If you don't realize you have it on, then you may be building a Debug configuration - use a Release configuration.

With an allocatable array, the array checking code has to fetch the bounds from the array descriptor each time (well, it may not HAVE to but it does), but with a static array it knows the bounds at compile time.

0 Kudos
abhimodak
New Contributor I
2,976 Views

Hi Steve

Attached is source file and an excel results files.

I am using Win64 XP Profession, Visual Studio 2005. I have Xeon 5150 (2.66 GHz) with 6 Gb of RAM.

I ran the test with two compilers 11.0.039 beta and 10.1.024 with win32 and x64 builds AND in the default "release" configuration. There are no additional checks etc. that I activated.

What I find is that the computation time starts to differ when NP = 1000000 or higher. When reading from user's input, this starts to happen with NP one order of magnitude smaller.

However, what I am surprised at is the difference made by use of X(1:NP) and just X.

I really hope that there is not silly error in my program. But I am wondering what is going on.

Sincerely

Abhi

0 Kudos
abhimodak
New Contributor I
2,976 Views

Although I used "Add Files" I don't see them in my post; hence I am pasting my code and the test results here:

Results:

NP 10000000
Compiler 11.0.039 Beta
Win32 x64
Loop 0.125 0.125
WithDim 0.09375 0.082031
NOTDim 8.753906 4.375
Compiler 10.1.024
Win32 x64
Loop 0.125 0.109375
WithDim 0.09375 0.109375
NOTDim 8.53125 4.28125

Source

Program Test_AllocationSpeed
!
! Purpose: Test Speed difference when using allocatable arrays.
!
Implicit None
!
Integer :: NP
Real(8), Allocatable :: X(:), Y(:)
!
! Integer, Parameter :: NP = 10000000
! Real(8) :: X(NP), Y(NP)
!
Integer :: ial, i, k
Character(32) :: AllocationError

Real(8) :: ts, te
!
!###################
!
! Print *, "Give NP"
! Read *, NP
!
NP = 10000000

Allocate(X(NP), Y(NP), stat=ial)!, ERRMSG = AllocationError)
if (ial /= 0) then
Stop
!Write(*,"(A)") Trim(AllocationError)
endif

! With Loop
Call CPU_Time(ts)
do k = 1, 100
do i = 1, NP
X(i) = 0.0d0
Y(i) = 0.0d0
enddo
end do
Call CPU_Time(te)
Write(*,"(A)") "With Loop:"
Write(*,"(A,ES14.6)") "Computation time with Loop :", (te-ts)
Write(*,*)

! With whole array dimensioned
Call CPU_Time(ts)
do k = 1, 100
X(1:NP) = 0.0d0
Y(1:NP) = 0.0d0
end do
Call CPU_Time(te)
Write(*,"(A)") "With whole array dimensioned:"
Write(*,"(A,ES14.6)") "Computation time :", (te-ts)
Write(*,*)

! With whole array NOT dimensioned
Call CPU_Time(ts)
do k = 1, 100
X = 0.0d0
Y = 0.0d0
end do
Call CPU_Time(te)
Write(*,"(A)") "With whole array NOT dimensioned:"
Write(*,"(A,ES14.6)") "Computation time :", (te-ts)
Write(*,*)

!
End Program Test_AllocationSpeed
!
!===============================================================================
!><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><
!===============================================================================
!

0 Kudos
brovchik
Beginner
2,976 Views

On your test I have next times on Intel Core2 Duo 2.4GHz on Win32 and 10.0. compiler:

Allocatable Static

loop 8.04 7.875

with DIM 7.546 7.546875

NOT DIM 7.531 7.546875

I'm very confused, because after I run your test as separate project, my test with allocatble and static arrays began to show almost equal times. This is very strange to me because I did not change anyting! The only idea i have is that some external programs can influence memory access of the testet program. May be in your case it is something similar. Your case is very interesting and your times for "loop" and "dim" look very optimistic. Very interesting to know is such significant speed up is an error or it can be achieved somehow.

0 Kudos
Steven_L_Intel1
Employee
2,976 Views

Show us the ifort command line used - it will be in the build log, a link to which will be displayed after you build the project. My guess is that you used an optimization level that removed one or both loops completely.

0 Kudos
abhimodak
New Contributor I
2,976 Views

Hi

I had actually meant to put in the timings with compiler version 10.0.025 ...Here are those

Compiler 10.0.025
Win32 x64
Loop 8.613281 4.4375
WithDim 8.316406 4.292969
NOTDim 8.425781 8.382813

Below are the Win32 and x64 build logs with 10.0.025

Abhi

================

[1] Win32

Deleting intermediate files and output files for project 'Test_AllocateSpeed', configuration 'Release|Win32'.
Compiling with Intel Fortran Compiler 10.0.025 [IA-32]...
ifort /nologo /module:"Release" /object:"Release" /libs:static /threads /c /Qvc8 /Qlocation,link,"C:Program Files (x86)Microsoft Visual Studio 8VCbin" "C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90"
Linking...
Link /OUT:"ReleaseTest_AllocateSpeed.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"C:AbhiMySourceTestsTest_AllocateSpeedreleasetest_allocatespeed.exe.intermediate.manifest" /SUBSYSTEM:CONSOLE /IMPLIB:"C:AbhiMySourceTestsTest_AllocateSpeedreleasetest_allocatespeed.lib" "ReleaseTest_AllocateSpeed.obj"
Link: executing 'link'

Embedding manifest...
mt.exe /nologo /outputresource:"C:AbhiMySourceTestsTest_AllocateSpeedreleasetest_allocatespeed.exe;#1" /manifest "C:AbhiMySourceTestsTest_AllocateSpeedreleasetest_allocatespeed.exe.intermediate.manifest"

Test_AllocateSpeed - 0 error(s), 0 warning(s)

[2] x64

Deleting intermediate files and output files for project 'Test_AllocateSpeed', configuration 'Release|x64'.
Compiling with Intel Fortran Compiler 10.0.025 [Intel 64]...
ifort /nologo /module:"x64Release" /object:"x64Release" /libs:static /threads /c /Qvc8 /Qlocation,link,"C:Program Files (x86)Microsoft Visual Studio 8VCbinx86_amd64" "C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90"
C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90(34): (col. 13) remark: LOOP WAS VECTORIZED.
C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90(47): (col. 13) remark: LOOP WAS VECTORIZED.
C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90(48): (col. 13) remark: LOOP WAS VECTORIZED.
C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90(58): (col. 13) remark: LOOP WAS VECTORIZED.
C:AbhiMySourceTestsTest_AllocateSpeedTest_AllocateSpeed.f90(59): (col. 13) remark: LOOP WAS VECTORIZED.

Linking...
Link /OUT:"x64ReleaseTest_AllocateSpeed.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"C:AbhiMySourceTestsTest_AllocateSpeedx64releasetest_allocatespeed.exe.intermediate.manifest" /SUBSYSTEM:CONSOLE /IMPLIB:"C:AbhiMySourceTestsTest_AllocateSpeedx64releasetest_allocatespeed.lib" "x64ReleaseTest_AllocateSpeed.obj"
Link: executing 'link'

Embedding manifest...
mt.exe /nologo /outputresource:"C:AbhiMySourceTestsTest_AllocateSpeedx64releasetest_allocatespeed.exe;#1" /manifest "C:AbhiMySourceTestsTest_AllocateSpeedx64releasetest_allocatespeed.exe.intermediate.manifest"

Test_AllocateSpeed - 0 error(s), 0 warning(s)

0 Kudos
emmahenley
Beginner
2,976 Views

Other remote possibilities include: alignment problems. NaNs in the clipboard and paste them in.
You don't say what kind of processor you're using. An unfortunate cached stride or some kind of prefetch disagreement seems unlikely with such a simple example but it's a remote possibility.
Your arrays are large enough that the performance of this code will be able to it; right-click, got to disassembly. Copy the surrounding several dozen lines to the heap. That example is short enough that you can't read assembly try posing both under VTune and see the differences. Even more exotic stuff I can't imagine right now.
Run both versions here.
Set a breakpoint in the loop and debug to look at 10x better than bad code, on current x64 hardware. (using write combining, no-fetch sequences versus actually updating individual cells one by how well the cache is managed. Best code will be right at the generated assembly and see what's going on. If you should be determined by one).

-------------------------

Emma Henley
Sidekick Phones - Sidekick Phones - 32 HDTV LCD

0 Kudos
Steve_Nuchia
New Contributor I
2,976 Views
160 megabytes of data * 100 passes = 16e9 bytes to write. At 8 seconds you're seeing only 2 GB/sec which is about half of what your hardware might be capable of if it has a really good ram configuration. Anything less than 4 seconds means some of the code is optimized away.

0 Kudos
abhimodak
New Contributor I
2,976 Views

Honestly, I am not sure if I follow the last two posts....

I have put the the timings with three versions of the compiler for Win32 and x64. I don't understand the way are...What really bothers me the performance of whole array operations.

Earlier Steve suggested that there may be /check:bounds. But it is not so since I use the release configuration for all these runs.

I am using Xeon 5150 (2.66 GHz) with 6 Gb of RAM. The operating system in WinXP 64 Professional.

Abhi

0 Kudos
Steven_L_Intel1
Employee
2,976 Views

I see what looks like bounds checking code in the assembly listing you posted.

0 Kudos
abhimodak
New Contributor I
2,976 Views

Hi Steve

So even when I am NOT using /check:bounds it is getting in there? How come it is not affecting the loops? I am very confused.. Should I be using the whole array (without dimension) syntax or not?

Abhi

0 Kudos
Steven_L_Intel1
Employee
2,976 Views

You are showing the assembly output from a debug configuration.

0 Kudos
abhimodak
New Contributor I
2,976 Views

Hi Steve

I did NOT put any assembly code... I only put the build logs. The assmebly code was put by the brovchik. I am running only the "release" mode and all the computation times I reported are only with the release mode.

Abhi

0 Kudos
Steven_L_Intel1
Employee
2,803 Views

Abhi,

Sorry, I did not notice that.

0 Kudos
Reply