Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

## nested loop vectorization in OpenMP taskloop

Beginner
1,403 Views

Hi everybody,

I have a simple program with a four nested loop, the outer loop is parallelized with OpenMP taskloop directive and I tried to vectorized the innermost loop.

```program main

use modf
use omp_lib

implicit none

integer :: n,i,j,k
integer :: d1,d2,d3,d4
double precision :: corr
double precision :: time

d1 = 100
d2 = 100
d3 = 100
d4 = 40
corr = 2.3

!\$omp parallel
!\$omp master

allocate(matrix(d1,d2,d3,d4))
allocate(matout(d1,d2,d3,d4))

matrix(:,:,:,:) = 0.0

time = omp_get_wtime()
DO n=1,d4

DO i=1,d3
DO j=1,d2

!\$omp simd aligned(matrix,matout:64)
DO k=1,d1
matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr)
ENDDO
!\$omp end simd

ENDDO
ENDDO

ENDDO

time = omp_get_wtime() - time

!\$omp end master
!\$omp end parallel

print*,time

end program main```

where arraycomp function is contained in module :

```module modf

double precision,allocatable,dimension(:,:,:,:) :: matrix
double precision,allocatable,dimension(:,:,:,:) :: matout

contains

function arraycomp(in1,in2) result(output)
!\$omp declare simd(arraycomp)
double precision, intent(inout) :: in1,in2
double precision:: output
output = (in1 + abs(in2))
end function arraycomp

end module```

and the code is compiled with this Makefile ( ifort 17.0.1 ) :

```test.xx : *.f90
ifort -O3 -g -xAVX -qopenmp -qopt-report5 -align array64byte \$^ -o \$@```

My problem is that the compiler don't have succes to vectorize the innermost loop  and in optr file is reported this error:

```LOOP BEGIN at main.f90(28,7)
remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

LOOP BEGIN at main.f90(31,5)
remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification

LOOP BEGIN at main.f90(32,9)
remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification

LOOP BEGIN at main.f90(34,19)
remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
LOOP END
LOOP END
LOOP END
LOOP END```

Probably this kind of error is due to a runtime assegnation in the task of loop variable.

But there are a way to avoid this behaviour and vectorize correctly the innermost loop?

Thanks for attention

Best regards

Eric

5 Replies
Honored Contributor III
1,403 Views

The OpenMP standard has a lot of restrictions on what is allowed in a loop targeted by a SIMD pragma.  One restriction that might be relevant here is that the loop cannot contain any branches to outside the loop.  I would guess that the function call is considered to be a branch

Manually inlining the "arraycomp" function should enable vectorization.

Beginner
1,403 Views

Thanks for your reply. You're right  with manual inlining the loop was vectorized :

``` LOOP BEGIN at main.f90(35,13)
remark #15388: vectorization support: reference at (36:17) has aligned access   [ main.f90(36,17) ]
remark #15389: vectorization support: reference at (36:36) has unaligned access   [ main.f90(36,36) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15399: vectorization support: unroll factor set to 4
remark #15309: vectorization support: normalized vectorization overhead 0.364
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 10
remark #15477: vector cost: 2.750
remark #15478: estimated potential speedup: 3.230
remark #15488: --- end vector cost summary ---
LOOP END```

but I think that the behavoiur is due to taskloop and not to OpenMP simd, because if I use openmp  do instead openmp taskloop the code was perfectly vectorized:

```         LOOP BEGIN at main.f90(35,13)
remark #15389: vectorization support: reference matrix_(k,j,i,n) has unaligned access   [ main.f90(36,45) ]
remark #15389: vectorization support: reference matout_(k,j,i,n) has unaligned access   [ main.f90(36,17) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15399: vectorization support: unroll factor set to 4
remark #15309: vectorization support: normalized vectorization overhead 0.052
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
remark #15451: unmasked unaligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 124
remark #15477: vector cost: 70.000
remark #15478: estimated potential speedup: 1.750
remark #15484: vector function calls: 1
remark #15488: --- end vector cost summary ---
remark #15489: --- begin vector function matching report ---
remark #15490: Function call: ARRAYCOMP with simdlen=2, actual parameter types: (vector,uniform)   [ main.f90(36,35) ]
remark #15492: A suitable vector variant was found (out of 2) with xmm, simdlen=2, unmasked, formal parameter types: (vector,vector)
remark #15493: --- end vector function matching report ---
LOOP END```

Naturally I prefer the first approach because the speedup his higher!

Thanks again

Eric

Honored Contributor III
1,403 Views

In your modf, you have not attributed the arrays as being aligned. Therefore the allocation will not (required to) be aligned.

```!dir\$ attributes align: 64:: matrix
double precision,allocatable,dimension(:,:,:,:) :: matrix
!dir\$ attributes align: 64:: matout
double precision,allocatable,dimension(:,:,:,:) :: matout
```

Then specify the function as a vector function

```function arraycomp(in1,in2) result(output)
!dir\$ attributes vector :: arraycomp
double precision, intent(inout) :: in1,in2
double precision:: output
output = (in1 + abs(in2))
end function arraycomp
```

Or target a specific processor architecture

```function arraycomp(in1,in2) result(output)
!dir\$ attributes vector : processor(core_4th_gen_avx) :: arraycomp
!... !dir\$ attributes vector : processor(mic_avx512 ) :: arraycomp

double precision, intent(inout) :: in1,in2
double precision:: output
output = (in1 + abs(in2))
end function arraycomp
```

Then remove the !\$omp simd/end simd

Note, your inner loop k (the one able to vectorize) is .NOT. an OpenMP sliceable DO loop index. Ergo, !\$omp simd of this loop index variable is nonsensical.

Jim Dempsey

Beginner
1,403 Views

I edit the modf.f90 following your suggestion and I try to add the directive to force inlinig:

```module modf

!dir\$ attributes align: 64:: matrix
double precision,allocatable,dimension(:,:,:,:) :: matrix
!dir\$ attributes align: 64:: matout
double precision,allocatable,dimension(:,:,:,:) :: matout

contains

!DEC\$ ATTRIBUTES FORCEINLINE :: arraycomp
function arraycomp(in1,in2) result(output)
!dir\$ attributes vector :: arraycomp
double precision, intent(inout) :: in1,in2
double precision:: output
output = (in1 + abs(in2))
end function arraycomp

end module```

but the loop in main.f is not vectorized. The only way seem the manual inlining or put the function into main program with contains

```program main

use modf
use omp_lib

implicit none

integer :: n,i,j,k
integer :: d1,d2,d3,d4
double precision :: corr
double precision :: time

d1 = 100
d2 = 100
d3 = 100
d4 = 40
corr = 2.3

!\$omp parallel
!\$omp master

allocate(matrix(d1,d2,d3,d4))
allocate(matout(d1,d2,d3,d4))

matrix(:,:,:,:) = 0.0

time = omp_get_wtime()
DO n=1,d4

DO i=1,d3
DO j=1,d2

DO k=1,d1
matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr)
ENDDO

ENDDO
ENDDO

ENDDO

time = omp_get_wtime() - time

!\$omp end master
!\$omp end parallel

print*,time !,matout(5,5,5,5)

contains

function arraycomp(in1,in2) result(output)
double precision, intent(inout) :: in1,in2
double precision:: output
output = (in1 + abs(in2))
end function arraycomp

end program main```

Reding this two article in Intel site:

https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization

https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

seems the only way to align one array declared in one module and allocated in one other is use the flag compiler: -align array64byte

From the example, I understand that it is possible to indicate to compiler that can be vectorize a  loop that work on module array with the directive  !dir\$ vector aligned but with the taskloop it is incompatible because the ifort return error due to I must use taskloop in a master session:

```main.f90(35): error #7631: This statement or directive is not permitted within the body of an OpenMP* MASTER/END MASTER
block.
!dir\$ vector aligned
------------------^
compilation aborted for main.f90 (code 1)```

Thanks

Eric

Honored Contributor III
1,403 Views

Note, when the arrays matrix and matout are aligned allocated, this only assures the compiler that the entire array lowest cell is aligned. IOW only when all indexes of the arrays are a lbound that it is assured to be aligned. Thus for any slicing up of the array (parallel constructs), the compiler cannot know the starting point is aligned.

If you pass a multi-dimensioned array into a parallel region, you might be able to get the loop to vectorize if you can successfully get the collapse to work: