- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have made a small testcase out of a bigger program where the compiler fails to vectorize a loop due to "assumed dependency" although one used the loop simd statement:
program fortTest integer, allocatable :: vals(:) integer, allocatable :: vals2(:) integer, allocatable :: send(:) integer i,j,ct,tmp,tmp2,tmp3 ct=10000 allocate(vals(ct*ct)) allocate(vals2(ct*ct)) allocate(send(ct)) do i=1, ct send(i)=i do j=1, ct vals2(i*ct+j)=i+1+j end do end do !$omp parallel do simd private(tmp,tmp2,tmp3) schedule(runtime) do i=1, ct tmp = vals2(i) * 2 tmp2 = vals2(i+ct) * 2 tmp3 = vals2(i+2*ct) * 2 vals(send(i)) = tmp+tmp2 vals(send(i)+i*ct) = tmp+tmp3 end do end
Please note, that once u remove the "schedule(runtime)" clause, the loop gets vectorized and my (big) program receives a 4x speedup
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With parallel do simd you're asking the compiler to divide the loop into chunks large enough to apply vectorization within each chunk. It may not be surprising if it knows how to do this only with default static schedule. There's nothing in this example to indicate why that would not be optimum.
Your scatter storage of the results presumably requires those to be stored in effect by scalar instructions, e.g. extractps if using SSE4. So the usefulness of vectorization would be limited, although it may help. A 4x gain from vectorization would be remarkable even on AVX2 or MIC.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Before we can address the key issue, you've got a bug in your initialization of the vals2 array:
do i=1, ct
send(i)=i
do j=1, ct
vals2(i*ct+j)=i+1+j
end do
end do
vals2 is dimensioned ct*ct (100000000), so when i .eq. ct, you've trying to write to vals2(100000000+1), etc.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Patrick: Right, that code was a quick&dirty approach to get the compiler behavior that I wanted to demonstrate. That this would fail on runtime is ok, although it should be "do
i=0, ct
-1" in both loops with access to send(i+1) then.
Still doesn't matter as the behavior is the same.
@Tim: What I expect the compiler to do here is to divide the loops into chunks (e.g. 4 for SSE) and vectorize those. After that the remaining iterations should be parallelized and schedule with the given scheduling method (in that case decided at runtime)
The whole purpose of loop simd is, that one does not have to manually split the loops to extract a block that can get vectorized like:
do i=1,max,VectorSize do j=i,i+VectorSize-1 ! ...
At least this is what I expect it to be.
The spec states "Specifies a loop that can be executed concurrently using SIMD instructions, and that those iterations will also be executed in parallel..."
What the compiler instead shows: "This loop may or may not have vector dependencies. I won't vectorize it" But the whole purpose of "omp simd" is, to tell the compiler that there are NO dependencies and it can vectorize it.
Compare this with omitting the "parallel do": If you have no "omp simd" the compiler will show the exact same behavior as it (wrongly) shows now. But with "omp simd" it happily vectorizes the loop.
I did try to ommit both simd and schedule(runtime) and got the same result as with both.
Yes the gain is remarkable but it is like that and also produces correct results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The schedule(runtime) clause adds a constraint that inhibits certain loop optimizations in the 14.0 compiler, defeating OpenMP SIMD loop vectorization.
However, the 15.0 has figured this out. U515099.f90 is your original, unmodified (non-runnable) example:
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.040 Beta Build 20140428
$ ifort -openmp -vec-report -opt-report-stdout U515099.f90 |grep LOOP -
LOOP BEGIN at U515099.f90(13,3)
LOOP BEGIN at U515099.f90(15,5)
LOOP END
LOOP BEGIN at U515099.f90(15,5)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at U515099.f90(15,5)
LOOP END
LOOP END
LOOP BEGIN at U515099.f90(21,3)
LOOP END
LOOP BEGIN at U515099.f90(21,3)
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at U515099.f90(21,3)
LOOP END
LOOP BEGIN at U515099.f90(21,3)
LOOP END
$
Patrick
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page