OMP simd gets ignored in omp loop simd schedule(runtime)

Grund__Aalexander · ‎05-12-2014

I have made a small testcase out of a bigger program where the compiler fails to vectorize a loop due to "assumed dependency" although one used the loop simd statement:

program fortTest
  integer, allocatable :: vals(:)
  integer, allocatable :: vals2(:)
  integer, allocatable :: send(:)
  integer i,j,ct,tmp,tmp2,tmp3
  
  
  ct=10000
  allocate(vals(ct*ct))
  allocate(vals2(ct*ct))
  allocate(send(ct))

  do i=1, ct
    send(i)=i
    do j=1, ct
      vals2(i*ct+j)=i+1+j
    end do
  end do
  
  !$omp parallel do simd private(tmp,tmp2,tmp3) schedule(runtime)
  do i=1, ct
    tmp = vals2(i) * 2
    tmp2 = vals2(i+ct) * 2
    tmp3 = vals2(i+2*ct) * 2
    vals(send(i)) = tmp+tmp2
    vals(send(i)+i*ct) = tmp+tmp3
  end do

end

Please note, that once u remove the "schedule(runtime)" clause, the loop gets vectorized and my (big) program receives a 4x speedup

TimP · ‎05-12-2014

With parallel do simd you're asking the compiler to divide the loop into chunks large enough to apply vectorization within each chunk. It may not be surprising if it knows how to do this only with default static schedule. There's nothing in this example to indicate why that would not be optimum.

Your scatter storage of the results presumably requires those to be stored in effect by scalar instructions, e.g. extractps if using SSE4. So the usefulness of vectorization would be limited, although it may help. A 4x gain from vectorization would be remarkable even on AVX2 or MIC.

pbkenned1 · ‎05-12-2014

Before we can address the key issue, you've got a bug in your initialization of the vals2 array:

do i=1, ct
    send(i)=i
    do j=1, ct
      vals2(i*ct+j)=i+1+j
    end do
end do

vals2 is dimensioned ct*ct (100000000), so when i .eq. ct, you've trying to write to vals2(100000000+1), etc.

Patrick

Grund__Aalexander · ‎05-12-2014

@Patrick: Right, that code was a quick&dirty approach to get the compiler behavior that I wanted to demonstrate. That this would fail on runtime is ok, although it should be "do i=0, ct-1" in both loops with access to send(i+1) then.

Still doesn't matter as the behavior is the same.

@Tim: What I expect the compiler to do here is to divide the loops into chunks (e.g. 4 for SSE) and vectorize those. After that the remaining iterations should be parallelized and schedule with the given scheduling method (in that case decided at runtime)
The whole purpose of loop simd is, that one does not have to manually split the loops to extract a block that can get vectorized like:

do i=1,max,VectorSize
  do j=i,i+VectorSize-1
    ! ...

At least this is what I expect it to be.
The spec states "Specifies a loop that can be executed concurrently using SIMD instructions, and that those iterations will also be executed in parallel..."
What the compiler instead shows: "This loop may or may not have vector dependencies. I won't vectorize it" But the whole purpose of "omp simd" is, to tell the compiler that there are NO dependencies and it can vectorize it.
Compare this with omitting the "parallel do": If you have no "omp simd" the compiler will show the exact same behavior as it (wrongly) shows now. But with "omp simd" it happily vectorizes the loop.

I did try to ommit both simd and schedule(runtime) and got the same result as with both.

Yes the gain is remarkable but it is like that and also produces correct results.

pbkenned1 · ‎05-12-2014

The schedule(runtime) clause adds a constraint that inhibits certain loop optimizations in the 14.0 compiler, defeating OpenMP SIMD loop vectorization.

However, the 15.0 has figured this out. U515099.f90 is your original, unmodified (non-runnable) example:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.040 Beta Build 20140428

$ ifort -openmp -vec-report -opt-report-stdout U515099.f90 |grep LOOP -
LOOP BEGIN at U515099.f90(13,3)
   LOOP BEGIN at U515099.f90(15,5)
   LOOP END
   LOOP BEGIN at U515099.f90(15,5)
      remark #15300: LOOP WAS VECTORIZED
   LOOP END
   LOOP BEGIN at U515099.f90(15,5)
   LOOP END
LOOP END
LOOP BEGIN at U515099.f90(21,3)
LOOP END
LOOP BEGIN at U515099.f90(21,3)
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at U515099.f90(21,3)
LOOP END
LOOP BEGIN at U515099.f90(21,3)
LOOP END
$

Patrick