Re: Performance penalty for allocatable arrays?

Intel_C_Intel · ‎01-27-2003

Is there an inherent penalty associated with using allocatable arrays? I have 2 examples below. The one with allocatable arrays in a module takes 33% more CPU time. If I change the allocatable to pointer in the array declarations, the CPU time jumps another 68%

Example 1: Static arrays accessed through common blocks.

      program main
      parameter (na = 1000000)
      common /A_MOD/ dt,A(na),dAdt(na)
      real t(2)

      dt = 0.001
      do n=1,1000
        call SUB
      enddo
      t2 = DTIME(t)
      print *, t

      stop
      end

      subroutine SUB
      parameter (na = 1000000)
      common /A_MOD/ dt,A(na),dAdt(na)

      do n=1,na
        A(n) = A(n) + dAdt(n)*dt
      enddo

      return
      end

Example 2: Allocatable arrays accessed through module

      program main
      use A_MOD
      real t(2)

      na = 1000000
      ALLOCATE(A(na))
      ALLOCATE(dAdt(na))

      do n=1,1000
        call SUB
      enddo
	t2 = DTIME(t)
	print *, t

      stop
      end

      subroutine SUB
      use A_MOD

      do n=1,na
        A(n) = A(n) + dAdt(n)*dt
      enddo

      return
      end

      module A_MOD

      integer :: na
      real :: dt = 0.001
      real, allocatable, dimension(:) :: A, dAdt

      end module A_MOD

Steven_L_Intel1 · ‎01-28-2003

Apples and oranges. You're comparing compile-time known array bounds with those that have to be computed and fetched at run-time, not to mention accessed through a pointer as compared to link-time static addressing in the "static" case.

Steve

Intel_C_Intel · ‎01-28-2003

If I want to use deferred-shape arrays, is there anything I can do (compiler settings) to keep the run time down? I have run the same example on several UNIX platforms and find the CPU time is only slightly worse (~2%) or sometimes better when the allocatable array is used.

Does the computation of array bounds and access of memory take that much time compared to the arithmetic operations?

Steven_L_Intel1 · ‎01-29-2003

Let me suggest you try something else first. Compile your programs with "maximum optimizations". See what times you get.

Steve

Intel_C_Intel · ‎01-29-2003

Increasing the optimization level from 4 to 5 gets me about 15% better CPU time when using the allocatable arrays. The improvement whit the explicit arrays is 85%, though.

I modified by example a little, so the subroutine is exactly the same and the only difference in the module is that the arrays are either explicit or allocatable.

subroutine SUB
use A_MOD

do n=1,na
A(n) = A(n) + dAdt(n)*dt
enddo

return
end

I have combined a portion of the assembly listing below, with spaces added to align things. It looks to me like everything inside the loop is the same and the difference is in the addressing, which looks like it is done one per call. Shouldn't this mean that as I increase the size of the array (and therefore the number of times through the loop) that the difference in CPU time should decrease? It doesn't. Doubling the size doubles the CPU time for both cases.

	PUBLIC	_SUB@0                      	PUBLIC	_SUB@0
_SUB@0	PROC                                _SUB@0	PROC
	sub	esp, 8                      	sub	esp, 8
 ;     30       use A_MOD
 ;     31
 ;     32       do n=1,na
	mov	eax, 1000000		    	mov	eax, 1000000
					    	lea	edx, dword ptr A_MOD_mp_DADTps_$
					    	lea	ecx, dword ptr A_MOD_mp_Aps_$
	push	ebx			    	push	ebx
 ;     33         A(n) = A(n) + dAdt(n)*dt
	mov	ecx, dword ptr .data$+36
	fld	dword ptr .data$	    	fld	dword ptr .data$
	mov	edx, dword ptr .data$+8
	mov	ebx, dword ptr .data$+68
	fstp	st(1)			    	fstp	st(1)
	shl	ecx, 2
	sub	edx, ecx
	mov	ecx, dword ptr .data$+40
	add	edx, 4
	shl	ebx, 2
	sub	ecx, ebx
	lea	ecx, dword ptr 4[ecx]
	add	eax, 0
	mov	eax, eax                    	mov	eax, eax
					    	nop
lab$0044:				    lab$0040:
	fld	st(0)			    	fld	st(0)
	fmul	dword ptr [edx]             	fmul	dword ptr [edx]
	fadd	dword ptr [ecx]             	fadd	dword ptr [ecx]
	fstp	dword ptr [ecx]             	fstp	dword ptr [ecx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 4[edx]            	fmul	dword ptr 4[edx]
	fadd	dword ptr 4[ecx]            	fadd	dword ptr 4[ecx]
	fstp	dword ptr 4[ecx]            	fstp	dword ptr 4[ecx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 8[edx]            	fmul	dword ptr 8[edx]
	fadd	dword ptr 8[ecx]            	fadd	dword ptr 8[ecx]
	fstp	dword ptr 8[ecx]            	fstp	dword ptr 8[ecx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 12[edx]           	fmul	dword ptr 12[edx]
	fadd	dword ptr 12[ecx]           	fadd	dword ptr 12[ecx]
	fstp	dword ptr 12[ecx]           	fstp	dword ptr 12[ecx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 16[edx]           	fmul	dword ptr 16[edx]
	fadd	dword ptr 16[ecx]           	fadd	dword ptr 16[ecx]
	fstp	dword ptr 16[ecx]           	fstp	dword ptr 16[ecx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 20[edx]           	fmul	dword ptr 20[edx]
	fadd	dword ptr 20[ecx]           	fadd	dword ptr 20[ecx]
	fstp	dword ptr 20[ecx]           	fstp	dword ptr 20[ecx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 24[edx]           	fmul	dword ptr 24[edx]
	fadd	dword ptr 24[ecx]           	fadd	dword ptr 24[ecx]
	fstp	dword ptr 24[ecx]           	fstp	dword ptr 24[ecx]
	prefetch qword ptr 284[edx]         	prefetch qword ptr 284[edx]
	fld	st(0)                       	fld	st(0)
	fmul	dword ptr 28[edx]           	fmul	dword ptr 28[edx]
	prefetchw qword ptr 284[ecx]        	prefetc
hw qword ptr 284[ecx]
 ;     34       enddo
	add	edx, 32		            	add	edx, 32
	fadd	dword ptr 28[ecx]           	fadd	dword ptr 28[ecx]
	fstp	dword ptr 28[ecx]           	fstp	dword ptr 28[ecx]
	add	ecx, 32		            	add	ecx, 32
	sub	eax, 8                      	sub	eax, 8
	cmp	eax, 0                      	cmp	eax, 0
	jg	lab$0044                    	jg	lab$0040
 ;     35                                    ;     35
 ;     36       return                       ;     36       return
 ;     37       end                          ;     37       end
	ffree	st(0)		            	ffree	st(0)
	pop	ebx                         	pop	ebx
	add	esp, 8                      	add	esp, 8
	ret                                 	ret
_SUB@0	ENDP                                _SUB@0	ENDP
	END                                 	END

Steven_L_Intel1 · ‎01-29-2003

Doubling the array size will double the number of memory accesses, which may be a significant factor.

Steve