Slow allocatable arrays - Page 2

brovchik · ‎10-03-2008

Dear all. I faced with problem that program works with large allocatable arrays much slower then with static arrays. Below is simple code that initialize large array. In case of static arrays this code works 10! times faster.
I'm using Intel Fortran Compiler 10.0 under Windows with 2GB RAM.

Anybody now what is the reason and what to do to make allocatable arrays work faster?

!integer, parameter :: NP=10000000
integer NP
real, allocatable :: X(:),Y(:)
!real X(10000000),Y(10000000)
integer i,k,ist,iend,icountrate

NP = 10000000;
allocate(X(NP),Y(NP))

do k = 1, 100
do i = 1, NP
X(i) = 0.
Y(i) = 0.
enddo
enddo

Steve_Nuchia · ‎10-15-2008

OK, looking just at what you posted, it is clear that two of the three variations are heavily optimized, since they finish in far less time than the memory subsystem could possibly process the steps called for at the source level.

That leaves the interesting question: why isn't the third variation also optimized? It is taking the "right" amount of time, so it isn't doing anything really stupid, it's just missing an optimization opportunity that it finds in the other two cases.

I don't know. Sounds like a deficiency in the optimizer but it could be something else.

abhimodak · ‎10-15-2008

Thanks. I feel sort of relieved that I wasn't doing something really silly with the code snippet and/or building the exe.

I am trying to see the effect of "size" as dictated by NP and number of iterations, NLoops (it is hard-wired to 100 in my sample program posted here). What I consistently see is that the whole array not dimensioned (the third variation) is consistently taking the longest time. This is true even for NP = 10,000 and number of loops = 100,000.

Time increased computation time is proportional to both NP and NLoops. The difference is really staggering as either of this increases.

When not using the allocatable arrays all three variations take exact same time. On the other hand, if either of the NP or NLoops is "read" from input, the increased in computation time starts to happen for smaller values of these variables than when set in the program.

I am surely concerned now whether to use the "array language" or not. At present, it looks like one will have to use it only with explicit dimenions present.

Abhi

jimdempseyatthecove · ‎10-15-2008

Abhi,

There may be another factor at play here.

When you configure to run with statically allocated arrays the arrays are preallocated, and initialized and when the program command line is issued the initialized array is read into memory. At the point of execution of 1st statement in your program the virtual memory of your static array had already been touched. Some of this virtual memory is in cache, but more importantly all of the virtual memory has been committed.

When you configure the program to run with allocations, then the allocation statement by itselfmay result in the virtual address space being assigned to your application but only with the header portion of the allocation being committed (due to it being formerly linked into the heap). Subsequent to the allocation, as you fill in the array the first time, the application will page fault as it crosses page boundaries until the first pass is finished. After first pass the virtual memory is inthe committed state and subsequent reuse is without page fault.

A proper test will perform the allocation, perform an initial wipe or touch in page length increments. Then obtain the timmer counter prior to entering the performance test loop.

Jim Dempsey

jimdempseyatthecove · ‎10-15-2008

Also, I thought I might add this for future reference as your data file may grow.

64 bit Vista is particularly bad in this area and XP Pro x64 is less so but to some extent suffers from a similar problem. In my opinion it is a design flaw in in the operating system.

My system has 4GB of physical RAM. The memory foot print of my workhorse application plus data is a few 100's of megabytes. Easily loads and execues on a 1GB laptop. When the program runs it generates a massive log file. Depending on the run this can be the 10's of GB (~40GB was my largest). On the 4GB system, as the log file approaches 4GB the performance drops to about 1% while the system undergoes a massive flurry of paging operations. I have a control pannel button that will pause the program and when the program exhibits this slow down, ifI pause the application the disk thrashing continues for several minutes. When the sytem is quiessent again, resuming the application gets the application running at 100% again. It is somewhat annoying to have to babysit the program as it runs overnight.

In my opinion, MS needs to learn a thing or two about writing virtual memory systems. Now, there may be a registry setting to change this behavior but unfortunately ever since about 1997 the search programof their knowledge base was dumbed down to the point where it is almost useless to express a proper search expression an find what you want.

Jim Dempsey

abhimodak · ‎10-15-2008

Hi Jim

Many thanks for your detailed posts.

However, I feel like I understand about 50% of what you are saying. The reason that I feel unconvinced is that I don't quite grasp why writing A(1:N) = something and A = something should make a difference. And it is NOT a small difference based on what I am seeing.

I think I know that these two are not "exactly same" since A can be A(lowebound:upperbound) where the bounds are not 1 and N, respectively. Thus, internal something different may need to happen at the machine level when using the former. (Pardon my simplistic approach.)

Some of the F90/95 books would make more than a special mention of Array Language (I believe one example would be the Numerical Recipies book). It is then more than disturbing to know that such performance penalties exist.

There may be no escape from it but, just to re-iterate my point, I somehow (read adamantly!) refuse to believe that writing with or without explicit partition can make a difference.

I am going to see if I can run this program with excetuables created by other compilers.

Sincerely

Abhi

abhimodak · ‎10-20-2008

Ok, I did tests with CVF 6.6c and Absoft 8.2.

In the respective release mode, there is virtually no difference in computation time for all the three variations.

Abhi

Ron_Green · ‎12-01-2008

Whatever CVF and Absoft does is irrelevant, especially when absolute performance is not specified.

What I found is that the example is flawed. The K=1,100 loops are subject to removal, and in fact the -opt-report 3 option will show that the loops are interchanged and guess what - when K is the inner loop, the initializations can be hoisted to the outer I= loop and the inner loop is then empty and removed. I discovered it does this for the first 2 loops but not for the final, hence handicapping it. The reason the array syntax is not removing the K= loop is that this array syntax xlates to a call to intel_fast_memset() function call wherease the first 2 loops directly xlate to vectorized loops using SSE. The intel_fast_memset() call may prevent the removal of the K= loop since the compiler may make the safe assumption that there may be side effects from the function call.

Anyhow, let's remove the noise (ie the K=1,100 silliness). Here is a better example:

Program Test_AllocationSpeed
!
! Purpose: Test Speed difference when using allocatable arrays.
!
Implicit None
!
Integer :: NP
Real(8), Allocatable :: X(:), Y(:)
!
! Integer, Parameter :: NP = 10000000
! Real(8) :: X(NP), Y(NP)
!
Integer :: ial, i, k
Character(32) :: AllocationError

Real(8) :: ts, te
NP = 300000000

Allocate(X(NP), Y(NP), stat=ial)!, ERRMSG = AllocationError)
if (ial /= 0) then
Stop
!Write(*,"(A)") Trim(AllocationError)
endif
!...warm up the X and Y vectors (first touch)
X = 0.0_8
Y = 0.0_8

!..With Loop
Call CPU_Time(ts)
do i = 1, NP
X(i) = 0.0d0
Y(i) = 0.0d0
enddo
Call CPU_Time(te)
Write(*,"(A)") "With Loop:"
Write(*,"(A,ES14.6)") "Computation time with Loop :", (te-ts)
Write(*,*)

! With whole array dimensioned
Call CPU_Time(ts)
X(1:NP) = 0.0d0
Y(1:NP) = 0.0d0
Call CPU_Time(te)
Write(*,"(A)") "With whole array dimensioned:"
Write(*,"(A,ES14.6)") "Computation time :", (te-ts)
Write(*,*)

! With whole array NOT dimensioned
Call CPU_Time(ts)
X = 0.0d0
Y = 0.0d0
Call CPU_Time(te)
Write(*,"(A)") "With whole array NOT dimensioned:"
Write(*,"(A,ES14.6)") "Computation time :", (te-ts)
Write(*,*)

!
End Program Test_AllocationSpeed

Now granted, I'm a linux/mac guy, so perhaps there is some Windoze x64 anomaly going on. Note that I first touch the X and Y vectors to get over initial paging effects. But for the example above, this is what I get on x64 linux (and similar on Mac OS X):

$ ./test4
With Loop:
Computation time with Loop : 1.830000E+00

With whole array dimensioned:
Computation time : 1.830000E+00

With whole array NOT dimensioned:
Computation time ~~: 1.140000E+00~~

I am experimenting on Windoze x64 and am seeing some differences. I have to reduce the array size to fit my memory footprint. It does seem that Win x64 the last loop, the array syntax that xlates to intel_fast_memset() call seems to be somewhere between 40% to 100% slower than the vectorized versions. Again, since we're only seeing this on Win x64 (not Linux, Win x32 nor Mac OS X (tested that too) ) I would have to assume this is isolated to Windows x64. Maybe a bug on intel_fast_memset() isolated to Win x64, but I'm not finished with my analysis.

cheers

ron