nested loop vectorization in OpenMP taskloop

eric_p_ · ‎12-16-2016

Hi everybody,

I have a simple program with a four nested loop, the outer loop is parallelized with OpenMP taskloop directive and I tried to vectorized the innermost loop.

program main

use modf
use omp_lib

implicit none

integer :: n,i,j,k
integer :: d1,d2,d3,d4
double precision :: corr
double precision :: time

d1 = 100
d2 = 100
d3 = 100
d4 = 40
corr = 2.3

!$omp parallel
!$omp master

allocate(matrix(d1,d2,d3,d4))
allocate(matout(d1,d2,d3,d4))

matrix(:,:,:,:) = 0.0

time = omp_get_wtime()
!$omp taskloop default(none) firstprivate(d1,d2,d3,d4,corr) shared(matrix,matout)
DO n=1,d4

    DO i=1,d3
        DO j=1,d2

            !$omp simd aligned(matrix,matout:64)
            DO k=1,d1
                matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr)
            ENDDO
            !$omp end simd

        ENDDO
    ENDDO

ENDDO
!$omp end taskloop

time = omp_get_wtime() - time

!$omp end master
!$omp end parallel


print*,time 

end program main

where arraycomp function is contained in module :

module modf

double precision,allocatable,dimension(:,:,:,:) :: matrix
double precision,allocatable,dimension(:,:,:,:) :: matout

contains

function arraycomp(in1,in2) result(output)
    !$omp declare simd(arraycomp)
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp

end module

and the code is compiled with this Makefile ( ifort 17.0.1 ) :

test.xx : *.f90
		ifort -O3 -g -xAVX -qopenmp -qopt-report5 -align array64byte $^ -o $@

My problem is that the compiler don't have succes to vectorize the innermost loop and in optr file is reported this error:

LOOP BEGIN at main.f90(28,7)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at main.f90(31,5)
      remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification

      LOOP BEGIN at main.f90(32,9)
         remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification

         LOOP BEGIN at main.f90(34,19)
            remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
         LOOP END
      LOOP END
   LOOP END
LOOP END

Probably this kind of error is due to a runtime assegnation in the task of loop variable.

But there are a way to avoid this behaviour and vectorize correctly the innermost loop?

Thanks for attention

Best regards

Eric

McCalpinJohn · ‎12-16-2016

The OpenMP standard has a lot of restrictions on what is allowed in a loop targeted by a SIMD pragma. One restriction that might be relevant here is that the loop cannot contain any branches to outside the loop. I would guess that the function call is considered to be a branch

Manually inlining the "arraycomp" function should enable vectorization.

eric_p_ · ‎12-16-2016

Thanks for your reply. You're right with manual inlining the loop was vectorized :

 LOOP BEGIN at main.f90(35,13)
            remark #15388: vectorization support: reference at (36:17) has aligned access   [ main.f90(36,17) ]
            remark #15389: vectorization support: reference at (36:36) has unaligned access   [ main.f90(36,36) ]
            remark #15381: vectorization support: unaligned access used inside loop body
            remark #15305: vectorization support: vector length 4
            remark #15399: vectorization support: unroll factor set to 4
            remark #15309: vectorization support: normalized vectorization overhead 0.364
            remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
            remark #15442: entire loop may be executed in remainder
            remark #15449: unmasked aligned unit stride stores: 1 
            remark #15450: unmasked unaligned unit stride loads: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 10 
            remark #15477: vector cost: 2.750 
            remark #15478: estimated potential speedup: 3.230 
            remark #15488: --- end vector cost summary ---
LOOP END

but I think that the behavoiur is due to taskloop and not to OpenMP simd, because if I use openmp do instead openmp taskloop the code was perfectly vectorized:

         LOOP BEGIN at main.f90(35,13)
            remark #15389: vectorization support: reference matrix_(k,j,i,n) has unaligned access   [ main.f90(36,45) ]
            remark #15389: vectorization support: reference matout_(k,j,i,n) has unaligned access   [ main.f90(36,17) ]
            remark #15381: vectorization support: unaligned access used inside loop body
            remark #15305: vectorization support: vector length 2
            remark #15399: vectorization support: unroll factor set to 4
            remark #15309: vectorization support: normalized vectorization overhead 0.052
            remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
            remark #15451: unmasked unaligned unit stride stores: 2 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 124 
            remark #15477: vector cost: 70.000 
            remark #15478: estimated potential speedup: 1.750 
            remark #15484: vector function calls: 1 
            remark #15488: --- end vector cost summary ---
            remark #15489: --- begin vector function matching report ---
            remark #15490: Function call: ARRAYCOMP with simdlen=2, actual parameter types: (vector,uniform)   [ main.f90(36,35) ]
            remark #15492: A suitable vector variant was found (out of 2) with xmm, simdlen=2, unmasked, formal parameter types: (vector,vector)
            remark #15493: --- end vector function matching report ---
         LOOP END

Naturally I prefer the first approach because the speedup his higher!

Thanks again

Eric

jimdempseyatthecove · ‎12-16-2016

In your modf, you have not attributed the arrays as being aligned. Therefore the allocation will not (required to) be aligned.

!dir$ attributes align: 64:: matrix
double precision,allocatable,dimension(:,:,:,:) :: matrix
!dir$ attributes align: 64:: matout
double precision,allocatable,dimension(:,:,:,:) :: matout

Then specify the function as a vector function

function arraycomp(in1,in2) result(output)
!dir$ attributes vector :: arraycomp
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp

Or target a specific processor architecture

function arraycomp(in1,in2) result(output)
!dir$ attributes vector : processor(core_4th_gen_avx) :: arraycomp
!... !dir$ attributes vector : processor(mic_avx512 ) :: arraycomp

    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp

Then remove the !$omp simd/end simd

Note, your inner loop k (the one able to vectorize) is .NOT. an OpenMP sliceable DO loop index. Ergo, !$omp simd of this loop index variable is nonsensical.

Jim Dempsey

eric_p_ · ‎12-19-2016

Hi Jim, thanks you to your reply.

I edit the modf.f90 following your suggestion and I try to add the directive to force inlinig:

module modf

!dir$ attributes align: 64:: matrix
double precision,allocatable,dimension(:,:,:,:) :: matrix
!dir$ attributes align: 64:: matout
double precision,allocatable,dimension(:,:,:,:) :: matout

contains

!DEC$ ATTRIBUTES FORCEINLINE :: arraycomp
function arraycomp(in1,in2) result(output)
    !dir$ attributes vector :: arraycomp
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp

end module

but the loop in main.f is not vectorized. The only way seem the manual inlining or put the function into main program with contains.

program main

use modf
use omp_lib

implicit none

integer :: n,i,j,k
integer :: d1,d2,d3,d4
double precision :: corr
double precision :: time

d1 = 100
d2 = 100
d3 = 100
d4 = 40
corr = 2.3

!$omp parallel
!$omp master

allocate(matrix(d1,d2,d3,d4))
allocate(matout(d1,d2,d3,d4))

matrix(:,:,:,:) = 0.0

time = omp_get_wtime()
!$omp taskloop default(none) firstprivate(d1,d2,d3,d4,corr) shared(matrix,matout)
DO n=1,d4

    DO i=1,d3
        DO j=1,d2

            DO k=1,d1          
                matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr)
            ENDDO
          

        ENDDO
    ENDDO

ENDDO
!$omp end taskloop

time = omp_get_wtime() - time

!$omp end master
!$omp end parallel


print*,time !,matout(5,5,5,5)

contains

function arraycomp(in1,in2) result(output)
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp


end program main

Reding this two article in Intel site:

https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization

https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

seems the only way to align one array declared in one module and allocated in one other is use the flag compiler: -align array64byte

From the example, I understand that it is possible to indicate to compiler that can be vectorize a loop that work on module array with the directive !dir$ vector aligned but with the taskloop it is incompatible because the ifort return error due to I must use taskloop in a master session:

main.f90(35): error #7631: This statement or directive is not permitted within the body of an OpenMP* MASTER/END MASTER
block.
            !dir$ vector aligned
------------------^
compilation aborted for main.f90 (code 1)

Thanks

Eric

jimdempseyatthecove · ‎12-19-2016

Note, when the arrays matrix and matout are aligned allocated, this only assures the compiler that the entire array lowest cell is aligned. IOW only when all indexes of the arrays are a lbound that it is assured to be aligned. Thus for any slicing up of the array (parallel constructs), the compiler cannot know the starting point is aligned.

If you pass a multi-dimensioned array into a parallel region, you might be able to get the loop to vectorize if you can successfully get the collapse to work:

!$OMP TASKLOOP COLLAPSE(4) ...

Though I think you would have better luck using:

!$OMP PARALLEL DO COLLAPSE(4) SCHEDULE(STATIC,SIMD) ...

Jim Dempsey