OpenMP parallel do simd collapse problem

Matthias_M_ · ‎07-01-2014

Dear all,

when compiling the following demo code

program collapse

  implicit none

  real, dimension(1000)   :: data1 = 1.0e0
  real, dimension(100,10) :: data2 = 1.0e0
  integer :: i,j

  !$omp parallel do simd
  do i=1,size(data1,1)
    data1(i) = data1(i) + 1.0e0
  end do
  !$omp end parallel do simd

  !$omp parallel do
  do i=1,size(data1,1)
    data1(i) = data1(i) + 1.0e0
  end do
  !$omp end parallel do
  
  !$omp parallel do simd collapse(2)
  do j=1,size(data2,2)
    do i=1,size(data2,1)
      data2(i,j) = data2(i,j) + 1.0e0
    end do
  end do
  !$omp end parallel do simd
  
  !$omp parallel do collapse(2)
  do j=1,size(data2,2)
    do i=1,size(data2,1)
      data2(i,j) = data2(i,j) + 1.0e0
    end do
  end do
  !$omp end parallel do

end program collapse

with

ifort (IFORT) 14.0.3 20140422

under

Ubuntu 12.04.4 LTS

running on a

Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz

I receive the following messages:

$> ifort -openmp -openmp_report2 -vec_report2 collapse.f90 -o collapse

collapse.f90(9): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

collapse.f90(15): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

collapse.f90(21): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

collapse.f90(29): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

collapse.f90(10): (col. 3) remark: OpenMP SIMD LOOP WAS VECTORIZED

collapse.f90(16): (col. 3) remark: LOOP WAS VECTORIZED

collapse.f90(25): (col. 5) remark: loop was not vectorized: statement cannot be vectorized

collapse.f90(22): (col. 3) warning #13379: loop was not vectorized with "simd"

collapse.f90(30): (col. 3) remark: loop was not vectorized: existence of vector dependence

Is there a reason she the loop (imp parallel do sims collapse(2)) starting at line 21-22 cannot be parallelized and vectorized, whereas the one at line 9-10 can be parallelized and vectorized? If so, can I add some more instructions to make this happen?

Thank you in advance,

Matthias

Ron_Green · ‎07-02-2014

that does look suspicious. Let me investigate.

Ron_Green · ‎07-02-2014

I opened a bug report DPD200358268

Thanks for sending this in. I'll report back when a fix is found.

ron

TimP · ‎07-02-2014

I wouldn't be surprised if the uaual parallel outer vector inner were better (with no simd or collapse), but maybe that's what you're trying to find out.

The loop at line 30 apparently comes down to the same thing, where collapse seems to prevent vectorization of the inner loop.

Matthias_M_ · ‎07-02-2014

Tim Prince wrote:

I wouldn't be surprised if the uaual parallel outer vector inner were better (with no simd or collapse), but maybe that's what you're trying to find out.

The loop at line 30 apparently comes down to the same thing, where collapse seems to prevent vectorization of the inner loop.

This is exactly what I try to find out. In my understanding 'parallel do simd' should do exactly this, i.e. generate an outer parallelized loop and within the loop use simd instructions. I wanted to use the collapse directive since it is not always guaranteed that the outer loop is large enough to benefit from parallelization. However, the total length of the array size(data,1)*size(data,2) should be sufficiently large to benefit from simded-parallelization.

TimP · ‎07-03-2014

parallel do simd appears not to get the benefit of peeling for alignment, so the possible advantage of a large parallel loop count in taking more advantage of threads is offset by a loss in effectiveness of vectorization. With an inner loop count of 100, you barely approach fully effective vectorization if using AVX-256. Your array sizes might be expected to show threaded scaling dropping off beyond 2 threads no matter how you go about it.

My assumption has been that parallel do simd is suited for the case where there isn't an effectively vectorizable inner loop. Even then, recent ifort may do a good job of optimization when not handicapped by clauses restricting how it goes about it.

I don't know any specific threshold in OpenMP such as the one in Cilk(tm) Plus where the loop count must be 8 times the number of workers in order for optimum automatic chunking to take place. Default static OpenMP scheduling such as you would get in the examples uses the largest possible chunks.