Data Prefetching using Fortran Directives

Sina_M_ · ‎06-11-2014

Hi every one,

I am working on sparse algorithms' optimization using Intel's Fortran compiler. After applying different optimization features I want to make suitable use of data prefetching and cache utilization. In order to do that I tested several probable configurations of prefetching directives and intrinsic functions on both Intel Corei7 and AMD APU processors. But I don't get expected results. But in a specific case I think I get a real prefetching which gives me a 3-4 times speed up.

Following is the faster code:

[fortran]

DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, X, TEMP

DOUBLE PRECISION :: SUM

INTEGER :: SIZE, I, J, COUNT, BLS, I0

SIZE = 1000000

BLS = 21 * 25

ALLOCATE(A2D(0:BLS * SIZE - 1))

ALLOCATE(X(0:SIZE - 1))

ALLOCATE(TEMP(0:BLS - 1))

DO COUNT = 0, 50

!$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)

!$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0)

!DEC$ SIMD

DO J = 0, SIZE - 1

I0 = BLS * J

DO I = 0, BLS - 1

TEMP(I) = A2D(I0 + I)

END DO

SUM = 0.D0

DO I = 0, BLS - 1

SUM = SUM + TEMP(I) * 2.D0

END DO

X(J) = SUM

END DO

!$OMP END DO

!$OMP END PARALLEL

END DO

[/fortran]

And the following is the code I expect to be correct but is around 4 times slower (I think because the prefetch directive does not work):

[fortran]

DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, X

DOUBLE PRECISION :: SUM

INTEGER :: SIZE, I, J, COUNT, BLS, I0

SIZE = 1000000

BLS = 21 * 25

ALLOCATE(A2D(0:BLS * SIZE - 1))

ALLOCATE(X(0:SIZE - 1))

DO COUNT = 0, 50

!$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)

!$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0)

!DEC$ PREFETCH A2D

DO J = 0, SIZE - 1

I0 = BLS * J

SUM = 0.D0

!DEC$ SIMD

DO I = 0, BLS - 1

SUM = SUM + A2D(I0 + I) * 2.D0

END DO

X(J) = SUM

END DO

!$OMP END DO

!$OMP END PARALLEL

END DO

[/fortran]

I am really confused and need your help.