- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody,
I have a simple program with a four nested loop, the outer loop is parallelized with OpenMP taskloop directive and I tried to vectorized the innermost loop.
program main use modf use omp_lib implicit none integer :: n,i,j,k integer :: d1,d2,d3,d4 double precision :: corr double precision :: time d1 = 100 d2 = 100 d3 = 100 d4 = 40 corr = 2.3 !$omp parallel !$omp master allocate(matrix(d1,d2,d3,d4)) allocate(matout(d1,d2,d3,d4)) matrix(:,:,:,:) = 0.0 time = omp_get_wtime() !$omp taskloop default(none) firstprivate(d1,d2,d3,d4,corr) shared(matrix,matout) DO n=1,d4 DO i=1,d3 DO j=1,d2 !$omp simd aligned(matrix,matout:64) DO k=1,d1 matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr) ENDDO !$omp end simd ENDDO ENDDO ENDDO !$omp end taskloop time = omp_get_wtime() - time !$omp end master !$omp end parallel print*,time end program main
where arraycomp function is contained in module :
module modf double precision,allocatable,dimension(:,:,:,:) :: matrix double precision,allocatable,dimension(:,:,:,:) :: matout contains function arraycomp(in1,in2) result(output) !$omp declare simd(arraycomp) double precision, intent(inout) :: in1,in2 double precision:: output output = (in1 + abs(in2)) end function arraycomp end module
and the code is compiled with this Makefile ( ifort 17.0.1 ) :
test.xx : *.f90 ifort -O3 -g -xAVX -qopenmp -qopt-report5 -align array64byte $^ -o $@
My problem is that the compiler don't have succes to vectorize the innermost loop and in optr file is reported this error:
LOOP BEGIN at main.f90(28,7) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive LOOP BEGIN at main.f90(31,5) remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification LOOP BEGIN at main.f90(32,9) remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification LOOP BEGIN at main.f90(34,19) remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification LOOP END LOOP END LOOP END LOOP END
Probably this kind of error is due to a runtime assegnation in the task of loop variable.
But there are a way to avoid this behaviour and vectorize correctly the innermost loop?
Thanks for attention
Best regards
Eric
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The OpenMP standard has a lot of restrictions on what is allowed in a loop targeted by a SIMD pragma. One restriction that might be relevant here is that the loop cannot contain any branches to outside the loop. I would guess that the function call is considered to be a branch
Manually inlining the "arraycomp" function should enable vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply. You're right with manual inlining the loop was vectorized :
LOOP BEGIN at main.f90(35,13) remark #15388: vectorization support: reference at (36:17) has aligned access [ main.f90(36,17) ] remark #15389: vectorization support: reference at (36:36) has unaligned access [ main.f90(36,36) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 4 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.364 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15449: unmasked aligned unit stride stores: 1 remark #15450: unmasked unaligned unit stride loads: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 10 remark #15477: vector cost: 2.750 remark #15478: estimated potential speedup: 3.230 remark #15488: --- end vector cost summary --- LOOP END
but I think that the behavoiur is due to taskloop and not to OpenMP simd, because if I use openmp do instead openmp taskloop the code was perfectly vectorized:
LOOP BEGIN at main.f90(35,13) remark #15389: vectorization support: reference matrix_(k,j,i,n) has unaligned access [ main.f90(36,45) ] remark #15389: vectorization support: reference matout_(k,j,i,n) has unaligned access [ main.f90(36,17) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.052 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15451: unmasked unaligned unit stride stores: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 124 remark #15477: vector cost: 70.000 remark #15478: estimated potential speedup: 1.750 remark #15484: vector function calls: 1 remark #15488: --- end vector cost summary --- remark #15489: --- begin vector function matching report --- remark #15490: Function call: ARRAYCOMP with simdlen=2, actual parameter types: (vector,uniform) [ main.f90(36,35) ] remark #15492: A suitable vector variant was found (out of 2) with xmm, simdlen=2, unmasked, formal parameter types: (vector,vector) remark #15493: --- end vector function matching report --- LOOP END
Naturally I prefer the first approach because the speedup his higher!
Thanks again
Eric
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your modf, you have not attributed the arrays as being aligned. Therefore the allocation will not (required to) be aligned.
!dir$ attributes align: 64:: matrix double precision,allocatable,dimension(:,:,:,:) :: matrix !dir$ attributes align: 64:: matout double precision,allocatable,dimension(:,:,:,:) :: matout
Then specify the function as a vector function
function arraycomp(in1,in2) result(output) !dir$ attributes vector :: arraycomp double precision, intent(inout) :: in1,in2 double precision:: output output = (in1 + abs(in2)) end function arraycomp
Or target a specific processor architecture
function arraycomp(in1,in2) result(output) !dir$ attributes vector : processor(core_4th_gen_avx) :: arraycomp !... !dir$ attributes vector : processor(mic_avx512 ) :: arraycomp double precision, intent(inout) :: in1,in2 double precision:: output output = (in1 + abs(in2)) end function arraycomp
Then remove the !$omp simd/end simd
Note, your inner loop k (the one able to vectorize) is .NOT. an OpenMP sliceable DO loop index. Ergo, !$omp simd of this loop index variable is nonsensical.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim, thanks you to your reply.
I edit the modf.f90 following your suggestion and I try to add the directive to force inlinig:
module modf !dir$ attributes align: 64:: matrix double precision,allocatable,dimension(:,:,:,:) :: matrix !dir$ attributes align: 64:: matout double precision,allocatable,dimension(:,:,:,:) :: matout contains !DEC$ ATTRIBUTES FORCEINLINE :: arraycomp function arraycomp(in1,in2) result(output) !dir$ attributes vector :: arraycomp double precision, intent(inout) :: in1,in2 double precision:: output output = (in1 + abs(in2)) end function arraycomp end module
but the loop in main.f is not vectorized. The only way seem the manual inlining or put the function into main program with contains.
program main use modf use omp_lib implicit none integer :: n,i,j,k integer :: d1,d2,d3,d4 double precision :: corr double precision :: time d1 = 100 d2 = 100 d3 = 100 d4 = 40 corr = 2.3 !$omp parallel !$omp master allocate(matrix(d1,d2,d3,d4)) allocate(matout(d1,d2,d3,d4)) matrix(:,:,:,:) = 0.0 time = omp_get_wtime() !$omp taskloop default(none) firstprivate(d1,d2,d3,d4,corr) shared(matrix,matout) DO n=1,d4 DO i=1,d3 DO j=1,d2 DO k=1,d1 matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr) ENDDO ENDDO ENDDO ENDDO !$omp end taskloop time = omp_get_wtime() - time !$omp end master !$omp end parallel print*,time !,matout(5,5,5,5) contains function arraycomp(in1,in2) result(output) double precision, intent(inout) :: in1,in2 double precision:: output output = (in1 + abs(in2)) end function arraycomp end program main
Reding this two article in Intel site:
https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization
https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization
seems the only way to align one array declared in one module and allocated in one other is use the flag compiler: -align array64byte
From the example, I understand that it is possible to indicate to compiler that can be vectorize a loop that work on module array with the directive !dir$ vector aligned but with the taskloop it is incompatible because the ifort return error due to I must use taskloop in a master session:
main.f90(35): error #7631: This statement or directive is not permitted within the body of an OpenMP* MASTER/END MASTER block. !dir$ vector aligned ------------------^ compilation aborted for main.f90 (code 1)
Thanks
Eric
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, when the arrays matrix and matout are aligned allocated, this only assures the compiler that the entire array lowest cell is aligned. IOW only when all indexes of the arrays are a lbound that it is assured to be aligned. Thus for any slicing up of the array (parallel constructs), the compiler cannot know the starting point is aligned.
If you pass a multi-dimensioned array into a parallel region, you might be able to get the loop to vectorize if you can successfully get the collapse to work:
!$OMP TASKLOOP COLLAPSE(4) ...
Though I think you would have better luck using:
!$OMP PARALLEL DO COLLAPSE(4) SCHEDULE(STATIC,SIMD) ...
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page