<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic The OpenMP standard has a lot in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117255#M7494</link>
    <description>&lt;P&gt;The OpenMP standard has a lot of restrictions on what is allowed in a loop targeted by a SIMD pragma.&amp;nbsp; One restriction that might be relevant here is that the loop cannot contain any branches to outside the loop.&amp;nbsp; I would guess that the function call is considered to be a branch&lt;/P&gt;

&lt;P&gt;Manually inlining the "arraycomp" function should enable vectorization.&lt;/P&gt;</description>
    <pubDate>Fri, 16 Dec 2016 15:41:27 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2016-12-16T15:41:27Z</dc:date>
    <item>
      <title>nested loop vectorization in OpenMP taskloop</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117254#M7493</link>
      <description>&lt;P&gt;Hi everybody,&lt;/P&gt;

&lt;P&gt;I have a simple program with a four nested loop, the outer loop is parallelized with OpenMP taskloop directive and I tried to vectorized the innermost loop.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;program main

use modf
use omp_lib

implicit none

integer :: n,i,j,k
integer :: d1,d2,d3,d4
double precision :: corr
double precision :: time

d1 = 100
d2 = 100
d3 = 100
d4 = 40
corr = 2.3

!$omp parallel
!$omp master

allocate(matrix(d1,d2,d3,d4))
allocate(matout(d1,d2,d3,d4))

matrix(:,:,:,:) = 0.0

time = omp_get_wtime()
!$omp taskloop default(none) firstprivate(d1,d2,d3,d4,corr) shared(matrix,matout)
DO n=1,d4

    DO i=1,d3
        DO j=1,d2

            !$omp simd aligned(matrix,matout:64)
            DO k=1,d1
                matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr)
            ENDDO
            !$omp end simd

        ENDDO
    ENDDO

ENDDO
!$omp end taskloop

time = omp_get_wtime() - time

!$omp end master
!$omp end parallel


print*,time 

end program main&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;where arraycomp function is contained in module :&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;module modf

double precision,allocatable,dimension(:,:,:,:) :: matrix
double precision,allocatable,dimension(:,:,:,:) :: matout

contains

function arraycomp(in1,in2) result(output)
    !$omp declare simd(arraycomp)
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp

end module&lt;/PRE&gt;

&lt;P&gt;and the code is compiled with this Makefile ( ifort 17.0.1 ) :&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;test.xx : *.f90
		ifort -O3 -g -xAVX -qopenmp -qopt-report5 -align array64byte $^ -o $@&lt;/PRE&gt;

&lt;P&gt;My problem is that the compiler don't have succes to vectorize the innermost loop &amp;nbsp;and in optr file is reported this error:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;LOOP BEGIN at main.f90(28,7)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at main.f90(31,5)
      remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification

      LOOP BEGIN at main.f90(32,9)
         remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification

         LOOP BEGIN at main.f90(34,19)
            remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
         LOOP END
      LOOP END
   LOOP END
LOOP END&lt;/PRE&gt;

&lt;P&gt;Probably this kind of error is due to a runtime assegnation in the task of loop variable.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;But there are a way to avoid this behaviour and vectorize correctly the innermost loop?&lt;/P&gt;

&lt;P&gt;Thanks for attention&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;Eric&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Dec 2016 13:43:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117254#M7493</guid>
      <dc:creator>eric_p_</dc:creator>
      <dc:date>2016-12-16T13:43:04Z</dc:date>
    </item>
    <item>
      <title>The OpenMP standard has a lot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117255#M7494</link>
      <description>&lt;P&gt;The OpenMP standard has a lot of restrictions on what is allowed in a loop targeted by a SIMD pragma.&amp;nbsp; One restriction that might be relevant here is that the loop cannot contain any branches to outside the loop.&amp;nbsp; I would guess that the function call is considered to be a branch&lt;/P&gt;

&lt;P&gt;Manually inlining the "arraycomp" function should enable vectorization.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Dec 2016 15:41:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117255#M7494</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-16T15:41:27Z</dc:date>
    </item>
    <item>
      <title>Thanks for your reply. You're</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117256#M7495</link>
      <description>&lt;P&gt;Thanks for your reply. You're right &amp;nbsp;with manual inlining the loop was vectorized :&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt; LOOP BEGIN at main.f90(35,13)
            remark #15388: vectorization support: reference at (36:17) has aligned access   [ main.f90(36,17) ]
            remark #15389: vectorization support: reference at (36:36) has unaligned access   [ main.f90(36,36) ]
            remark #15381: vectorization support: unaligned access used inside loop body
            remark #15305: vectorization support: vector length 4
            remark #15399: vectorization support: unroll factor set to 4
            remark #15309: vectorization support: normalized vectorization overhead 0.364
            remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
            remark #15442: entire loop may be executed in remainder
            remark #15449: unmasked aligned unit stride stores: 1 
            remark #15450: unmasked unaligned unit stride loads: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 10 
            remark #15477: vector cost: 2.750 
            remark #15478: estimated potential speedup: 3.230 
            remark #15488: --- end vector cost summary ---
LOOP END&lt;/PRE&gt;

&lt;P&gt;but I think that the behavoiur is due to taskloop and not to OpenMP simd, because if I use &lt;STRONG&gt;openmp &amp;nbsp;do &lt;/STRONG&gt;instead&lt;STRONG&gt;&amp;nbsp;openmp taskloop&amp;nbsp;&lt;/STRONG&gt;the code was perfectly vectorized:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;         LOOP BEGIN at main.f90(35,13)
            remark #15389: vectorization support: reference matrix_(k,j,i,n) has unaligned access   [ main.f90(36,45) ]
            remark #15389: vectorization support: reference matout_(k,j,i,n) has unaligned access   [ main.f90(36,17) ]
            remark #15381: vectorization support: unaligned access used inside loop body
            remark #15305: vectorization support: vector length 2
            remark #15399: vectorization support: unroll factor set to 4
            remark #15309: vectorization support: normalized vectorization overhead 0.052
            remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
            remark #15451: unmasked unaligned unit stride stores: 2 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 124 
            remark #15477: vector cost: 70.000 
            remark #15478: estimated potential speedup: 1.750 
            remark #15484: vector function calls: 1 
            remark #15488: --- end vector cost summary ---
            remark #15489: --- begin vector function matching report ---
            remark #15490: Function call: ARRAYCOMP with simdlen=2, actual parameter types: (vector,uniform)   [ main.f90(36,35) ]
            remark #15492: A suitable vector variant was found (out of 2) with xmm, simdlen=2, unmasked, formal parameter types: (vector,vector)
            remark #15493: --- end vector function matching report ---
         LOOP END&lt;/PRE&gt;

&lt;P&gt;Naturally I prefer the first approach because the speedup his higher!&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks again&lt;/P&gt;

&lt;P&gt;Eric&lt;/P&gt;</description>
      <pubDate>Fri, 16 Dec 2016 16:11:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117256#M7495</guid>
      <dc:creator>eric_p_</dc:creator>
      <dc:date>2016-12-16T16:11:15Z</dc:date>
    </item>
    <item>
      <title>In your modf, you have not</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117257#M7496</link>
      <description>&lt;P&gt;In your modf, you have not attributed the arrays as being aligned. Therefore the allocation will not (required to) be aligned.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;!dir$ attributes align: 64:: matrix
double precision,allocatable,dimension(:,:,:,:) :: matrix
!dir$ attributes align: 64:: matout
double precision,allocatable,dimension(:,:,:,:) :: matout
&lt;/PRE&gt;

&lt;P&gt;Then specify the function as a vector function&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;function arraycomp(in1,in2) result(output)
!dir$ attributes vector :: arraycomp
&amp;nbsp;&amp;nbsp;&amp;nbsp; double precision, intent(inout) :: in1,in2
&amp;nbsp;&amp;nbsp;&amp;nbsp; double precision:: output
&amp;nbsp;&amp;nbsp;&amp;nbsp; output = (in1 + abs(in2))
end function arraycomp
&lt;/PRE&gt;

&lt;P&gt;Or target a specific processor architecture&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;function arraycomp(in1,in2) result(output)
!dir$ attributes vector : processor(core_4th_gen_avx) :: arraycomp
!... !dir$ attributes vector : processor(mic_avx512 ) :: arraycomp

&amp;nbsp;&amp;nbsp;&amp;nbsp; double precision, intent(inout) :: in1,in2
&amp;nbsp;&amp;nbsp;&amp;nbsp; double precision:: output
&amp;nbsp;&amp;nbsp;&amp;nbsp; output = (in1 + abs(in2))
end function arraycomp
&lt;/PRE&gt;

&lt;P&gt;Then remove the !$omp simd/end simd&lt;/P&gt;

&lt;P&gt;Note, your inner loop k (the one able to vectorize) is .NOT. an OpenMP sliceable DO loop index. Ergo, !$omp simd of this loop index variable is nonsensical.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 16 Dec 2016 16:27:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117257#M7496</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-12-16T16:27:20Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117258#M7497</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hi Jim, thanks you to your reply.&lt;/P&gt;

&lt;P&gt;I edit the modf.f90 following your suggestion and I try to add the directive to force inlinig:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;module modf

!dir$ attributes align: 64:: matrix
double precision,allocatable,dimension(:,:,:,:) :: matrix
!dir$ attributes align: 64:: matout
double precision,allocatable,dimension(:,:,:,:) :: matout

contains

!DEC$ ATTRIBUTES FORCEINLINE :: arraycomp
function arraycomp(in1,in2) result(output)
    !dir$ attributes vector :: arraycomp
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp

end module&lt;/PRE&gt;

&lt;P&gt;but the loop in main.f is not vectorized. The only way seem the manual inlining or put the function into main program with &lt;STRONG&gt;contains&lt;/STRONG&gt;.&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;program main

use modf
use omp_lib

implicit none

integer :: n,i,j,k
integer :: d1,d2,d3,d4
double precision :: corr
double precision :: time

d1 = 100
d2 = 100
d3 = 100
d4 = 40
corr = 2.3

!$omp parallel
!$omp master

allocate(matrix(d1,d2,d3,d4))
allocate(matout(d1,d2,d3,d4))

matrix(:,:,:,:) = 0.0

time = omp_get_wtime()
!$omp taskloop default(none) firstprivate(d1,d2,d3,d4,corr) shared(matrix,matout)
DO n=1,d4

    DO i=1,d3
        DO j=1,d2

            DO k=1,d1          
                matout(k,j,i,n) = arraycomp(matrix(k,j,i,n),corr)
            ENDDO
          

        ENDDO
    ENDDO

ENDDO
!$omp end taskloop

time = omp_get_wtime() - time

!$omp end master
!$omp end parallel


print*,time !,matout(5,5,5,5)

contains

function arraycomp(in1,in2) result(output)
    double precision, intent(inout) :: in1,in2
    double precision:: output
    output = (in1 + abs(in2))
end function arraycomp


end program main&lt;/PRE&gt;

&lt;P&gt;Reding this two article in Intel site:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization"&gt;https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization"&gt;https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;seems the only way to align one array declared in one module and allocated in one other is use the flag compiler:&amp;nbsp;&lt;STRONG&gt;-align array64byte&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;From the example, I understand that it is possible to indicate to compiler that can be vectorize a &amp;nbsp;loop that work on module array with the directive &lt;STRONG&gt;&amp;nbsp;!dir$ vector aligned&lt;/STRONG&gt; but with the taskloop it is incompatible because the ifort return error due to I must use taskloop in a master session:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;main.f90(35): error #7631: This statement or directive is not permitted within the body of an OpenMP* MASTER/END MASTER
block.
            !dir$ vector aligned
------------------^
compilation aborted for main.f90 (code 1)&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;Thanks&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Eric&lt;/P&gt;</description>
      <pubDate>Mon, 19 Dec 2016 08:53:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117258#M7497</guid>
      <dc:creator>eric_p_</dc:creator>
      <dc:date>2016-12-19T08:53:00Z</dc:date>
    </item>
    <item>
      <title>Note, when the arrays matrix</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117259#M7498</link>
      <description>&lt;P&gt;Note, when the arrays matrix and matout are aligned allocated, this only assures the compiler that the entire array lowest cell is aligned. IOW only when all indexes of the arrays are a lbound that it is assured to be aligned. Thus for any slicing up of the array (parallel constructs), the compiler cannot know the starting point is aligned.&lt;/P&gt;

&lt;P&gt;If you pass a multi-dimensioned array into a parallel region, you might be able to get the loop to vectorize if you can successfully get the collapse to work:&lt;/P&gt;

&lt;P&gt;!$OMP TASKLOOP COLLAPSE(4) ...&lt;/P&gt;

&lt;P&gt;Though I think you would have better luck using:&lt;/P&gt;

&lt;P&gt;!$OMP PARALLEL DO COLLAPSE(4) SCHEDULE(STATIC,SIMD) ...&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Dec 2016 14:01:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/nested-loop-vectorization-in-OpenMP-taskloop/m-p/1117259#M7498</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-12-19T14:01:30Z</dc:date>
    </item>
  </channel>
</rss>

