"heap-arrays" option in intel 64 mode

qlsn5 · ‎06-02-2014

Hi~,

I have a question about ifort option "heap-arrays" in intel 64 mode(64bit).

I compiled a program which requires large computation in IA-32 mode(32bit) without "heap-arrays" option, and the computation time is about 3 seconds.

In intel 64 mode(64bit), I compiled the same program with "heap-arrays" option, but in this case, the computation time is about 100 seconds.

Could any one give me the reason of it and how I could get the same performance of IA-32 mode in intel 64 mode?

Steven_L_Intel1 · ‎06-02-2014

I think we'd need to see an example.

qlsn5 · ‎06-02-2014

I just uploaded my program.

qlsn5 · ‎06-02-2014

One missed file is attached.

Steven_L_Intel1 · ‎06-02-2014

Thanks for that.

First of all, I can see the difference in 32-bit. The problem appears to be inside the memory allocator - the pattern of allocations is causing it to spend a lot of time working with its free lists. The bulk of the time is taken up at the entry to NESTED_DPOL_2D where the automatic arrays B and C are declared.

This is the first program I have seen where /heap-arrays makes such a big difference. We'll investigate this some more. Did you have a need to use /heap-arrays? You could turn it on for some sources and not all if need be.

Steven_L_Intel1 · ‎06-02-2014

It's just malloc and free taking all that time - I was distracted by the additional debug library stuff that malloc/free does. The routines taking most of the time are small and don't do much work, so the allocate/free swamps the actual work. NESTEDMUL_DPOL is another one.

Steven_L_Intel1 · ‎06-02-2014

Oh, and I saw about an 8X change from 3 seconds to 24. I could never get it to 100 seconds. Be sure you're not building with debug libraries, which makes it worse.

qlsn5 · ‎06-02-2014

Steve Lionel (Intel) wrote:

It's just malloc and free taking all that time - I was distracted by the additional debug library stuff that malloc/free does. The routines taking most of the time are small and don't do much work, so the allocate/free swamps the actual work. NESTEDMUL_DPOL is another one.

Dear Steve,

Thank you very much for your explanations.

The computation and allocations of variables in heap memory take much time than variable in stack memory?

Actually, the main program that uses the routines I listed above needs lots of memory, so it needs to be compiled with "heap-arrays".

Another questions:

Which memory region are the assumed shaped arrays allocated? Stack or heap?

qlsn5 · ‎06-03-2014

Additionally, I found a difference in computational speed when variables are declared differently.

Let me show examples:

------------------------------------------------------------------------------------------------

case. 1

PROGRAM TEST
IMPLICIT NONE

real(4) :: time_begin, time_end

integer(4), parameter :: nn = 1000
real(8) :: aa(nn,nn), bb(nn,nn), cc(nn,nn)

aa = 1.0_8
bb = 1.0_8

call cpu_time(time_begin)
call foo(nn,aa,bb,cc)
call cpu_time(time_end)

print *, time_end - time_begin

contains
subroutine foo(n, a, b, c)
integer(4), intent(in) :: n
real(8), intent(in) :: a(:,:), b(:,:)
real(8), intent(out) :: c(:,:)

integer(4) :: i, j, k

do i = 1, n
do j = 1, n
c(i,j) = 0.0_8
do k = 1, n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
end subroutine foo
end program

==========================================

case. 2

PROGRAM TEST
IMPLICIT NONE

real(4) :: time_begin, time_end

integer(4), parameter :: nn = 1000
real(8) :: aa(nn,nn), bb(nn,nn), cc(nn,nn)

aa = 1.0_8
bb = 1.0_8

call cpu_time(time_begin)
call foo(nn,aa,bb,cc)
call cpu_time(time_end)

print *, time_end - time_begin

contains
subroutine foo(n, a, b, c)
integer(4), intent(in) :: n
real(8), intent(in) :: a(n,n), b(n,n)
real(8), intent(out) :: c(n,n)

integer(4) :: i, j, k

do i = 1, n
do j = 1, n
c(i,j) = 0.0_8
do k = 1, n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
end subroutine foo
end program

------------------------------------------------------------

The two cases are compiled in IA32 and without the option 'heap-arrays'.

The second case is much faster.

This means it is better to declare variables as automatic array than as assumed shape arrays. Is it true?

Steven_L_Intel1 · ‎06-03-2014

Assumed-shape arrays don't imply any particular allocation. If they also have the ALLOCATABLE attribute then they are always heap allocated. If POINTER, they're heap-allocated if ALLOCATE is used, otherwise they're whatever the target was when pointer assignment was done.

The computation aspect when using /heap-arrays isn't the issue - there is no difference. But there is a cost to heap allocation and deallocation, whereas stack allocation is a single subtract instruction.

Your two examples in the last post are something else entirely - the allocation is done in the main program and the arrays are all dummy arguments, not automatic arrays. The only difference is where the bounds are passed. In the second example, the compiler has more information about the bounds than it does in the first, and this can improve optimization. Most tests I have seen don't show significant differences here, though. When constructing such tests, make sure that the optimizer hasn't thrown away computational work because it sees the results were never used, which is exactly what happened here. When I add a use of C after the timing, I get identical times for the two programs.