- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a big set of code with OMP4.0 directives (target, simd...)
In one module the compiler throws lot's of warnings about "loops not vectorized with simd" although it should.
I cut the code down to the bare minimum that still produces this behaviour:
SUBROUTINE simdTest
IMPLICIT NONE
INTEGER :: i, j, k, sr, tn,nzb,nzt,nxl,nxr,nys,nyn
REAL :: s1, s2, s3, s4
REAL, DIMENSION(:,:,:), ALLOCATABLE :: u,v,pt,rmask,sums_l
REAL, DIMENSION(:,:), ALLOCATABLE :: usws,vsws,shf
!$omp parallel do schedule(runtime) private(s1,s2,s3)
DO k = nzb, nzt+1
!$omp simd collapse( 2 ) reduction( +: s1, s2, s3 )
DO i = nxl, nxr
DO j = nys, nyn
s1 = s1 + u(k,j,i) * rmask(j,i,sr)
s2 = s2 + v(k,j,i) * rmask(j,i,sr)
s3 = s3 + pt(k,j,i) * rmask(j,i,sr)
ENDDO
ENDDO
sums_l(k,1,tn) = s1
sums_l(k,2,tn) = s2
sums_l(k,4,tn) = s3
ENDDO
!$omp parallel do reduction( +: s1, s2, s3, s4) schedule(runtime)
DO i = nxl, nxr
DO j = nys, nyn
s1 = s1 + usws(j,i) * rmask(j,i,sr)
s2 = s2 + vsws(j,i) * rmask(j,i,sr)
s3 = s3 + shf(j,i) * rmask(j,i,sr)
s4 = s4 + 0.0
ENDDO
ENDDO
sums_l(nzb,12,tn) = s1
sums_l(nzb,14,tn) = s2
sums_l(nzb,16,tn) = s3
END SUBROUTINE
If you compile this with "ifort -openmp -O2" it will warn about the first loop. If you remove literally anything (even from second loop) it will vectorize.
Message from vec-report is "subscript to complex".
Could you explain that? IMO not vectorizing the first loop would lead to significant performance loss.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm afraid this is too much cut down to be a useful example. I'm concerned about the uninitialized reduction variables, unallocated arrays, and do-nothing reduction on s4.
Without the omp simd reduction, the compiler would be influenced by the architecture target in deciding whether to vectorize a sum reduction with a strided operand (and it usually makes a good decision). If you could put the k index last on u,v, pt it would be a lot more favorable to this organization of loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The uninitialized variables do not influence the optimization as far as I could see. In the real code, they are globals in a separate module and set before calling the function.
As far as I remember, using the reduction(+: list) clause will initialize the variables in the list with 0.
However here is another version of the code:
SUBROUTINE simdTest
IMPLICIT NONE
INTEGER :: i, j, k, sr, tn,nzb,nzt,nxl,nxr,nys,nyn
REAL :: s1, s2, s3, s4
REAL, DIMENSION(:,:,:), ALLOCATABLE :: u,v,pt,rmask,sums_l,rflags_invers
REAL, DIMENSION(:,:), ALLOCATABLE :: usws,vsws,shf
nzb = 0; nxl = 0; nys = 0; sr = 5; tn = 0
nzt = 100; nxr = 100; nyn = 100
allocate(u(nzt+1,nyn,nxr))
allocate(v(nzt+1,nyn,nxr))
allocate(pt(nzt+1,nyn,nxr))
allocate(rmask(nzt+1,nyn,nxr))
allocate(rflags_invers(nzt+1,nyn,nxr))
allocate(sums_l(nzt+1,20,1))
allocate(usws(nyn,nxr))
allocate(vsws(nyn,nxr))
allocate(shf(nyn,nxr))
!$omp parallel do schedule(runtime) private(s1,s2,s3)
DO k = nzb, nzt+1
s1=0;s2=0;s3=0
!$omp simd collapse( 2 ) reduction( +: s1, s2, s3 )
DO i = nxl, nxr
DO j = nys, nyn
s1 = s1 + u(k,j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
s2 = s2 + v(k,j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
s3 = s3 + pt(k,j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
ENDDO
ENDDO
sums_l(k,1,tn) = s1
sums_l(k,2,tn) = s2
sums_l(k,4,tn) = s3
ENDDO
s1=0;s2=0;s3=0;s4=0
!$omp parallel do reduction( +: s1, s2, s3, s4) schedule(runtime)
DO i = nxl, nxr
DO j = nys, nyn
s1 = s1 + usws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
s2 = s2 + vsws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
s3 = s3 + shf(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
s4 = s4 + shf(j,i)
ENDDO
ENDDO
sums_l(nzb,12,tn) = s1
sums_l(nzb,14,tn) = s2
sums_l(nzb,16,tn) = s3
END SUBROUTINE
Please note that the loop order is reasonable with the additional array (it is there in real code) and also changing the s4 reduction does not influence the behavior either.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At issue here is your index order for the arrays are not favorable to vectorization. Either the loops should be reordered or the dimensions of u, v, pt be reorganized to index as (j,i,k).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well I cannot change the order of the array dimensions. reordering the loops for u,v,pt will make them worse for rflags_invers
But strangely: If you remove anything (e.g. s4 = s4 + shf(j,i) from 2nd loop) then the first loop gets vectorized. Can you explain that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this loop (with stride indexing) I would look at the disassembly window from the debugger or VTune (not the assembler output as it will not be fully optimized). Merely seeing vectorized or not in the report is insufficient to tell what was done.
>>worse for rflags_invers
rflags_inverse is read once per iteration and used three times. This cost should be less than three strided fetches.
**** alternate suggestion
SUBROUTINE simdTest IMPLICIT NONE INTEGER :: i, j, k, sr, tn,nzb,nzt,nxl,nxr,nys,nyn REAL :: s1, s2, s3, s4 REAL, DIMENSION(:,:,:), ALLOCATABLE :: u,v,pt,rmask,sums_l,rflags_invers REAL, DIMENSION(:,:), ALLOCATABLE :: usws,vsws,shf, k_u, k_v, k_pt ! **** nzb = 0; nxl = 0; nys = 0; sr = 5; tn = 0 nzt = 100; nxr = 100; nyn = 100 allocate(u(nzt+1,nyn,nxr)) allocate(v(nzt+1,nyn,nxr)) allocate(pt(nzt+1,nyn,nxr)) ! vvvvvvv allocate(k_u(nyn,nxr)) ! **************** allocate(k_v(nyn,nxr)) ! **************** allocate(k_pt(nyn,nxr)) ! **************** ! ^^^^^^^ allocate(rmask(nzt+1,nyn,nxr)) allocate(rflags_invers(nzt+1,nyn,nxr)) allocate(sums_l(nzt+1,20,1)) allocate(usws(nyn,nxr)) allocate(vsws(nyn,nxr)) allocate(shf(nyn,nxr)) !$omp parallel do schedule(runtime) private(s1,s2,s3) DO k = nzb, nzt+1 s1=0;s2=0;s3=0 k_u = u(k,:,:) ! **************** k_v = v(k,:,:) ! **************** k_pt = pt(k,:,:) ! **************** !$omp simd collapse( 2 ) reduction( +: s1, s2, s3 ) DO i = nxl, nxr DO j = nys, nyn s1 = s1 + k_u(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) ! **************** s2 = s2 + k_v(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) ! **************** s3 = s3 + k_pt(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) ! **************** ENDDO ENDDO sums_l(k,1,tn) = s1 sums_l(k,2,tn) = s2 sums_l(k,4,tn) = s3 ENDDO s1=0;s2=0;s3=0;s4=0 !$omp parallel do reduction( +: s1, s2, s3, s4) schedule(runtime) DO i = nxl, nxr DO j = nys, nyn s1 = s1 + usws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) s2 = s2 + vsws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) s3 = s3 + shf(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) s4 = s4 + shf(j,i) ENDDO ENDDO sums_l(nzb,12,tn) = s1 sums_l(nzb,14,tn) = s2 sums_l(nzb,16,tn) = s3 END SUBROUTINE Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page