- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
when compiling the following demo code
program collapse implicit none real, dimension(1000) :: data1 = 1.0e0 real, dimension(100,10) :: data2 = 1.0e0 integer :: i,j !$omp parallel do simd do i=1,size(data1,1) data1(i) = data1(i) + 1.0e0 end do !$omp end parallel do simd !$omp parallel do do i=1,size(data1,1) data1(i) = data1(i) + 1.0e0 end do !$omp end parallel do !$omp parallel do simd collapse(2) do j=1,size(data2,2) do i=1,size(data2,1) data2(i,j) = data2(i,j) + 1.0e0 end do end do !$omp end parallel do simd !$omp parallel do collapse(2) do j=1,size(data2,2) do i=1,size(data2,1) data2(i,j) = data2(i,j) + 1.0e0 end do end do !$omp end parallel do end program collapse
with
ifort (IFORT) 14.0.3 20140422
Copyright (C) 1985-2014 Intel Corporation. All rights reserved.
under
Ubuntu 12.04.4 LTS
running on a
Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
I receive the following messages:
$> ifort -openmp -openmp_report2 -vec_report2 collapse.f90 -o collapse
collapse.f90(9): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
collapse.f90(15): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
collapse.f90(21): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
collapse.f90(29): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
collapse.f90(10): (col. 3) remark: OpenMP SIMD LOOP WAS VECTORIZED
collapse.f90(16): (col. 3) remark: LOOP WAS VECTORIZED
collapse.f90(25): (col. 5) remark: loop was not vectorized: statement cannot be vectorized
collapse.f90(22): (col. 3) warning #13379: loop was not vectorized with "simd"
collapse.f90(30): (col. 3) remark: loop was not vectorized: existence of vector dependence
Is there a reason she the loop (imp parallel do sims collapse(2)) starting at line 21-22 cannot be parallelized and vectorized, whereas the one at line 9-10 can be parallelized and vectorized? If so, can I add some more instructions to make this happen?
Thank you in advance,
Matthias
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that does look suspicious. Let me investigate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I opened a bug report DPD200358268
Thanks for sending this in. I'll report back when a fix is found.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wouldn't be surprised if the uaual parallel outer vector inner were better (with no simd or collapse), but maybe that's what you're trying to find out.
The loop at line 30 apparently comes down to the same thing, where collapse seems to prevent vectorization of the inner loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim Prince wrote:
I wouldn't be surprised if the uaual parallel outer vector inner were better (with no simd or collapse), but maybe that's what you're trying to find out.
The loop at line 30 apparently comes down to the same thing, where collapse seems to prevent vectorization of the inner loop.
This is exactly what I try to find out. In my understanding 'parallel do simd' should do exactly this, i.e. generate an outer parallelized loop and within the loop use simd instructions. I wanted to use the collapse directive since it is not always guaranteed that the outer loop is large enough to benefit from parallelization. However, the total length of the array size(data,1)*size(data,2) should be sufficiently large to benefit from simded-parallelization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
parallel do simd appears not to get the benefit of peeling for alignment, so the possible advantage of a large parallel loop count in taking more advantage of threads is offset by a loss in effectiveness of vectorization. With an inner loop count of 100, you barely approach fully effective vectorization if using AVX-256. Your array sizes might be expected to show threaded scaling dropping off beyond 2 threads no matter how you go about it.
My assumption has been that parallel do simd is suited for the case where there isn't an effectively vectorizable inner loop. Even then, recent ifort may do a good job of optimization when not handicapped by clauses restricting how it goes about it.
I don't know any specific threshold in OpenMP such as the one in Cilk(tm) Plus where the loop count must be 8 times the number of workers in order for optimum automatic chunking to take place. Default static OpenMP scheduling such as you would get in the examples uses the largest possible chunks.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page