- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel users, I'm using ifort 13.1.3 under Linux.
I would like to vectorize the inner loop but I'm not sure that the inner loop is a good candidate. This is the loop:
do i = 1, base_dim
x = log( real(n + 1) / real(base(i)) )
npow = floor(x)
sk = n
do j = npow, 0, -1
bk = base(i)**j
nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk
q(i) = q(i) + real(nk) / real(bk * base(i))
end do
end do
n, base and base_dim are function input. The loop is not vectorized due a dependency in a line sk = sk - nk*bk. I'm not sure that this is a prefix sum case, because each value depends on sk at the prevoius step. Do you have an idea?
Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're right, the inner loop would not be vectorizable. The current compiler suggests either switching the loops or applying simd directive on the outer loop, but the degree of mixed data types appears to be an impediment to auto-vectorization with short vector math library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At issue here is each q(i) potentially will have different iteration counts and thus not lend its self to vectorization as written.
The suggestion I have will require a little work, but my guess is your problem statement came about because you need to improve this routine. I will sketch what you can do, you decide if it is worth it to program.
Produce an array of npow's:
do i = 1, base_dim
x = log( real(n + 1) / real(base(i)) )
npow(i) = floor(x)
end do
Then for each unique npow within npow(:) create a gather/scatter index array. Using the scatter/gather index array, gather the base(i)'s into a new vector, say base_temp(:). I will assume the representative q(i)'s are initialized to 0, if so, initialize q_temp(:) to 0's, if not use scattery/gather index array to populate q_temp.
q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
Where n_temp is the number of matching npow's
At this point, base_temp(1:n_temp) and q_temp(1:n_temp) are same size, same npow'ed, compact vectors. Then using the current unique npow (which you saved in npow_selected):
do i=1,n_temp ! base_temp(1:n_temp) and q_temp(1:n_temp) sk = n do j = npow_selected, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp(i) = q_temp(i) + real(nk) / real(bk * base_temp(i)) end do end do
The above loop should vectorize. I will let you fill in the nitty-gritty of performing the gather and initializing the npow_selected.
After the above loop, the q_temp(1:n_temp) array is scattered via the gather/scatter index array back into q:
q(scattergather(1:n_temp)) = q_temp(1:n_temp)
You should see full vectorization of both loops.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Furthermore, the lines
nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk
could be replaced by
sk = mod(sk, bk)
Correction added later: if args to mod() may be negative, use modulo(sk, bk) instead; see below.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A better (untested) rewrite of my #4
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) sk = n do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp(i) = q_temp(i) + real(nk) / real(bk * base_temp(i)) end do end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
The above loop should vectorize. I will let you fill in the nitty-gritty of performing the gather and initializing the npow_selected.
After the above loop, the q_temp(1:n_temp) array is scattered via the gather/scatter index array back into q:
q(scattergather(1:n_temp)) = q_temp(1:n_temp)
You should see full vectorization of both loops.
Jim Dempsey
Hi Jim, thanks for your great reply! I tried your implementation, but the compiler tell on the inner loop "do j = npow, 0, -1" remark: loop was not vectorized: existence of vector dependence.
The dependance are in the lines:
nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk
I have not yet nested the results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this:
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) sk = n q_temp_delta = 0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp(i)) end do q_temp(i) = q_temp(i) + q_temp_delta end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
They are both equivalent. In both cases I do not see a vector dependency. The compiler may have been confused by something.
f this works, don't forget to try mecej4's sk= mod(sk,bk).
If the above reports vector dependency then try:
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) base_temp_i = base_temp(i) sk = n q_temp_delta = 0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp_i**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i) end do q_temp(i) = q_temp(i) + q_temp_delta end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
by using the first attempt (with q_temp), the compiler reports dependencies and the inner loop is not vectorized. In fact, the execution time is the same. The results are a bit different compared to original first version I reported. I suppose because there are many floating point accumulation and your version accumulates in a different order.
By using also base_temp_i, the execution time still is the same and the dependancies remains. The results are the same of first attempt.
By using sk = mod(sk, bk), the compiler vectorize the inner loop, the execution time perform very well (2x) but the results are totally wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
...but the results are totally wrong.
Is sk/bk going to be negative sometimes? If so, you may need to use the MODULO function instead of the MOD function. From the
Fortran documentation: "If a is of type real and p is not equal to zero, the value of MODULO(a,p) is a - FLOOR( a/ p) * p"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mecej4 wrote:
Quote:
...but the results are totally wrong.
Is sk/bk going to be negative sometimes? If so, you may need to use the MODULO function instead of the MOD function. From the
Fortran documentation: "If a is of type real and p is not equal to zero, the value of MODULO(a,p) is a - FLOOR( a/ p) * p"
I'm not sure, but also trying MODULO function the results are bad. The strange thing is I compared sk from the two original lines and from the MOD and MODULO function. They return the same values. I suspect the compiler optimization fails.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>By using also base_temp_i, the execution time still is the same and the dependancies remains. The results are the same of first attempt.
It is odd that it sees a dependency here, unless the intrinsic floor is non-vectorizable.
Can you make sk, bk and nk of type real, then remove the conversion functions and use
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) base_temp_i = base_temp(i) sk = n q_temp_delta = 0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = aint(base_temp_i**j) nk = aint( sk / bk ) ! requires sk/bk is positive q_temp_delta = q_temp_delta + nk / (bk * base_temp_i) end do q_temp(i) = q_temp(i) + q_temp_delta end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
Watch out for negative numbers
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
additional comment about integers and floats doubles and vectorization.
It is important that the vectors, implicit as well as explicit, have the same number of lanes (variables within the lane). In the prior code, if your temporaries (bk, nk, sk) are integers, the default size is 4 bytes. If your array q (and q_temp) is REAL(8), then the compiler may have an issue differing vector widths between the implicit ones generated from your temporaries and the implicit ones generated from your defined variables.
It is unclear as to if this is a problem for your code, but it could if you use a compiler option to set the default real to REAL(8).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unrue wrote:
The strange thing is I compared sk from the two original lines and from the MOD and MODULO function. They return the same values. I suspect the compiler optimization fails.
Let us get all the evidence before we announce the verdict. You have not shown more than a small fragment of code, and we do not know the gamut of values that the variables in that fragment may take. Here is a small example to show the difference between mod() and modulo(). Consider making up a small, self-contained example built around the code fragment that you showed. As you have observed, the code with the modulo() could give you significant speed-up because of vectorization.
program whichMod integer i,j,k1,k2 write(*,'(A,/)')' i N mod(N,15) modulo(N,15)' do i=-5,5,2 j=3*i-4 k1=mod(j,15) k2=modulo(j,15) write(*,'(4I10)')i,j,k1,k2 end do end program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
additional comment about integers and floats doubles and vectorization.
It is important that the vectors, implicit as well as explicit, have the same number of lanes (variables within the lane). In the prior code, if your temporaries (bk, nk, sk) are integers, the default size is 4 bytes. If your array q (and q_temp) is REAL(8), then the compiler may have an issue differing vector widths between the implicit ones generated from your temporaries and the implicit ones generated from your defined variables.
It is unclear as to if this is a problem for your code, but it could if you use a compiler option to set the default real to REAL(8).
Jim Dempsey
Hi Jim, thanks for your reply. I checked sk/bk and is never negative. Now all variables in the inner loop are real
I cannot substitute the two lines:
nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk
with sk = MOD(sk, bk) function, because nk is used in the next line:
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = base_tmp_i**j
nk = floor( sk / bk )
sk = sk - nk * bk
q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
end do
So I need to compute nk. This is the reason of wrong results of MOD and MUDULO functions. If I substitute nk with (sk/nk) in the last expression, the code is not vectorized.
In your last suggest:
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = aint(base_temp_i**j)
nk = aint( sk / bk ) ! requires sk/bk is positive
q_temp_delta = q_temp_delta + nk / (bk * base_temp_i)
end do
sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mecej4 wrote:
Let us get all the evidence before we announce the verdict. You have not shown more than a small fragment of code, and we do not know the gamut of values that the variables in that fragment may take. Here is a small example to show the difference between mod() and modulo(). Consider making up a small, self-contained example built around the code fragment that you showed. As you have observed, the code with the modulo() could give you significant speed-up because of vectorization.
program whichMod integer i,j,k1,k2 write(*,'(A,/)')' i N mod(N,15) modulo(N,15)' do i=-5,5,2 j=3*i-4 k1=mod(j,15) k2=modulo(j,15) write(*,'(4I10)')i,j,k1,k2 end do end program
Hi mecej4. I understand why MOD and MODULO functions gave me wrong results. Simply because variable nk was not calculated. See my post above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized
?? this is an iterative process, same number of iterations on each lane of the vector, each lane computing its own unique value. It should vectorize.
Lets help it along...
Try adding !DIR$ SIMD PRIVATE(sk, bk, nk, base_temp_i, q_temp_delta) to the do i = 1,n_temp loop
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
>>sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized
?? this is an iterative process, same number of iterations on each lane of the vector, each lane computing its own unique value. It should vectorize.
Lets help it along...
Try adding !DIR$ SIMD PRIVATE(sk, bk, nk, base_temp_i, q_temp_delta) to the do i = 1,n_temp loop
Jim Dempsey
Hi Jim,
By using yout suggest, the loop is now vectorized but the results are wrong. I tried:
!DIR$ SIMD PRIVATE(sk, bk, nk, base_tmp_i, q_tmp_delta)
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = aint(base_tmp_i**j)
nk = aint( sk / bk )
sk = sk - nk * bk
q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
end do
And:
!DIR$ SIMD PRIVATE(sk, bk, nk, base_tmp_i, q_tmp_delta)
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = base_tmp_i**j
nk = floor( sk / bk )
sk = sk - nk * bk
q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
end do
Disabling DIR SIMD, code is not vectorized and results are good. It seems the vectorization causes wrong results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let's take another attempt at skinning the cat.
real function q_temp_delta(n, npow, base_temp_i) !dir$ addributes vector :: q_temp_delta implicit none integer, intent(in) :: n,npow real, intent(in) :: base_temp_i ! local variables integer :: sk, bk, nk sk = n q_temp_delta = 0.0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp_i**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i) end do end function q_temp_delta ... do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) q_temp(i) = q_temp(i) + q_temp_delta(n, npow, base_temp(i)) end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
remove now unnecessary PRIVATE(...)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
many thanks for your great help, but the code still not working well (results are bad), and your last versione is not vectorized. The compiler still tell:
remark: loop was not vectorized: existence of vector dependence. module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 123 and sk line 123. module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 123 and sk line 123. module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 123 and sk line 122. module_Quasirandom.f90(122): (col. 12) remark: vector dependence: assumed ANTI dependence between sk line 122 and sk line 123.
The involved lines are:
nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk
Could it depends because I'm using OpenMP with 1 thread? The entire routine is under a parallel region, and some vector (like base, base_tmp, q_tmp, scatter_gather, npows, base_dim) are THREADPRIVATE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do not use the any thread private variables inside the function I provided.
consider adding PURE attribute to function name and adding UNIFORM clause to !DIR$
PURE real function q_temp_delta(n, npow, base_temp_i) !dir$ addributes vector:UNIFORM(n, npow) :: q_temp_delta implicit none integer, intent(in) :: n,npow real, intent(in) :: base_temp_i ! local variables integer :: sk, bk, nk sk = n q_temp_delta = 0.0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp_i**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i) end do end function q_temp_delta
Also, if floor continues to be an issue, replace it with your own floor equivalent code that does not have branch flow statements.
real_sk_over_bk = real(sk) / real(bk)
if(mod(real_sk_over_bk, 1.0) .ne. 0.0) real_sk_over_bk = real_sk_over_bk - 1.0
nk = real_sk_over_bk
Add the real variable real_sk_over_bk (change the name if you want), but do not rewrite the code.
The above code can generate and use the vector test and mask move instructions
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now, I'm not using any thread variable. This is the compiler report (a part):
module_Quasirandom.f90(111): (col. 22) remark: FUNCTION WAS VECTORIZED.
module_Quasirandom.f90(111): (col. 22) remark: FUNCTION WAS VECTORIZED.
module_Quasirandom.f90(121): (col. 5) remark: loop was not vectorized: existence of vector dependence.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 127 and sk line 127.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 127 and sk line 127.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 127 and sk line 124.
module_Quasirandom.f90(124): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 124 and sk line 127.
And results are still wrong. At line 111, there is q_tmp_delta function. At this point, I suspect a bug in my code in another section.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page