- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Dear Intel users, I'm using ifort 13.1.3 under Linux.
I would like to vectorize the inner loop but I'm not sure that the inner loop is a good candidate. This is the loop:
do i = 1, base_dim
x = log( real(n + 1) / real(base(i)) )
npow = floor(x)
sk = n
do j = npow, 0, -1
bk = base(i)**j
nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk
q(i) = q(i) + real(nk) / real(bk * base(i))
end do
end do
n, base and base_dim are function input. The loop is not vectorized due a dependency in a line sk = sk - nk*bk. I'm not sure that this is a prefix sum case, because each value depends on sk at the prevoius step. Do you have an idea?
Thanks in advance.
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
You're right, the inner loop would not be vectorizable. The current compiler suggests either switching the loops or applying simd directive on the outer loop, but the degree of mixed data types appears to be an impediment to auto-vectorization with short vector math library.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
At issue here is each q(i) potentially will have different iteration counts and thus not lend its self to vectorization as written.
The suggestion I have will require a little work, but my guess is your problem statement came about because you need to improve this routine. I will sketch what you can do, you decide if it is worth it to program.
Produce an array of npow's:
do i = 1, base_dim
x = log( real(n + 1) / real(base(i)) )
npow(i) = floor(x)
end do
Then for each unique npow within npow(:) create a gather/scatter index array. Using the scatter/gather index array, gather the base(i)'s into a new vector, say base_temp(:). I will assume the representative q(i)'s are initialized to 0, if so, initialize q_temp(:) to 0's, if not use scattery/gather index array to populate q_temp.
q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
Where n_temp is the number of matching npow's
At this point, base_temp(1:n_temp) and q_temp(1:n_temp) are same size, same npow'ed, compact vectors. Then using the current unique npow (which you saved in npow_selected):
do i=1,n_temp ! base_temp(1:n_temp) and q_temp(1:n_temp) sk = n do j = npow_selected, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp(i) = q_temp(i) + real(nk) / real(bk * base_temp(i)) end do end do
The above loop should vectorize. I will let you fill in the nitty-gritty of performing the gather and initializing the npow_selected.
After the above loop, the q_temp(1:n_temp) array is scattered via the gather/scatter index array back into q:
q(scattergather(1:n_temp)) = q_temp(1:n_temp)
You should see full vectorization of both loops.
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Furthermore, the lines
nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk
could be replaced by
sk = mod(sk, bk)
Correction added later: if args to mod() may be negative, use modulo(sk, bk) instead; see below.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
A better (untested) rewrite of my #4
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) sk = n do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp(i) = q_temp(i) + real(nk) / real(bk * base_temp(i)) end do end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
jimdempseyatthecove wrote:
The above loop should vectorize. I will let you fill in the nitty-gritty of performing the gather and initializing the npow_selected.
After the above loop, the q_temp(1:n_temp) array is scattered via the gather/scatter index array back into q:
q(scattergather(1:n_temp)) = q_temp(1:n_temp)
You should see full vectorization of both loops.
Jim Dempsey
Hi Jim, thanks for your great reply! I tried your implementation, but the compiler tell on the inner loop "do j = npow, 0, -1" remark: loop was not vectorized: existence of vector dependence.
The dependance are in the lines:
nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk
I have not yet nested the results.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Try this:
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) sk = n q_temp_delta = 0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp(i)) end do q_temp(i) = q_temp(i) + q_temp_delta end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
They are both equivalent. In both cases I do not see a vector dependency. The compiler may have been confused by something.
f this works, don't forget to try mecej4's sk= mod(sk,bk).
If the above reports vector dependency then try:
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) base_temp_i = base_temp(i) sk = n q_temp_delta = 0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp_i**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i) end do q_temp(i) = q_temp(i) + q_temp_delta end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Jim,
by using the first attempt (with q_temp), the compiler reports dependencies and the inner loop is not vectorized. In fact, the execution time is the same. The results are a bit different compared to original first version I reported. I suppose because there are many floating point accumulation and your version accumulates in a different order.
By using also base_temp_i, the execution time still is the same and the dependancies remains. The results are the same of first attempt.
By using sk = mod(sk, bk), the compiler vectorize the inner loop, the execution time perform very well (2x) but the results are totally wrong.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
...but the results are totally wrong.
Is sk/bk going to be negative sometimes? If so, you may need to use the MODULO function instead of the MOD function. From the
Fortran documentation: "If a is of type real and p is not equal to zero, the value of MODULO(a,p) is a - FLOOR( a/ p) * p"
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
mecej4 wrote:
Quote:
...but the results are totally wrong.
Is sk/bk going to be negative sometimes? If so, you may need to use the MODULO function instead of the MOD function. From the
Fortran documentation: "If a is of type real and p is not equal to zero, the value of MODULO(a,p) is a - FLOOR( a/ p) * p"
I'm not sure, but also trying MODULO function the results are bad. The strange thing is I compared sk from the two original lines and from the MOD and MODULO function. They return the same values. I suspect the compiler optimization fails.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
>>By using also base_temp_i, the execution time still is the same and the dependancies remains. The results are the same of first attempt.
It is odd that it sees a dependency here, unless the intrinsic floor is non-vectorizable.
Can you make sk, bk and nk of type real, then remove the conversion functions and use
do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) base_temp_i = base_temp(i) sk = n q_temp_delta = 0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = aint(base_temp_i**j) nk = aint( sk / bk ) ! requires sk/bk is positive q_temp_delta = q_temp_delta + nk / (bk * base_temp_i) end do q_temp(i) = q_temp(i) + q_temp_delta end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
Watch out for negative numbers
Jim
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
additional comment about integers and floats doubles and vectorization.
It is important that the vectors, implicit as well as explicit, have the same number of lanes (variables within the lane). In the prior code, if your temporaries (bk, nk, sk) are integers, the default size is 4 bytes. If your array q (and q_temp) is REAL(8), then the compiler may have an issue differing vector widths between the implicit ones generated from your temporaries and the implicit ones generated from your defined variables.
It is unclear as to if this is a problem for your code, but it could if you use a compiler option to set the default real to REAL(8).
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Unrue wrote:
The strange thing is I compared sk from the two original lines and from the MOD and MODULO function. They return the same values. I suspect the compiler optimization fails.
Let us get all the evidence before we announce the verdict. You have not shown more than a small fragment of code, and we do not know the gamut of values that the variables in that fragment may take. Here is a small example to show the difference between mod() and modulo(). Consider making up a small, self-contained example built around the code fragment that you showed. As you have observed, the code with the modulo() could give you significant speed-up because of vectorization.
program whichMod integer i,j,k1,k2 write(*,'(A,/)')' i N mod(N,15) modulo(N,15)' do i=-5,5,2 j=3*i-4 k1=mod(j,15) k2=modulo(j,15) write(*,'(4I10)')i,j,k1,k2 end do end program
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
jimdempseyatthecove wrote:
additional comment about integers and floats doubles and vectorization.
It is important that the vectors, implicit as well as explicit, have the same number of lanes (variables within the lane). In the prior code, if your temporaries (bk, nk, sk) are integers, the default size is 4 bytes. If your array q (and q_temp) is REAL(8), then the compiler may have an issue differing vector widths between the implicit ones generated from your temporaries and the implicit ones generated from your defined variables.
It is unclear as to if this is a problem for your code, but it could if you use a compiler option to set the default real to REAL(8).
Jim Dempsey
Hi Jim, thanks for your reply. I checked sk/bk and is never negative. Now all variables in the inner loop are real
I cannot substitute the two lines:
nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk
with sk = MOD(sk, bk) function, because nk is used in the next line:
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = base_tmp_i**j
nk = floor( sk / bk )
sk = sk - nk * bk
q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
end do
So I need to compute nk. This is the reason of wrong results of MOD and MUDULO functions. If I substitute nk with (sk/nk) in the last expression, the code is not vectorized.
In your last suggest:
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = aint(base_temp_i**j)
nk = aint( sk / bk ) ! requires sk/bk is positive
q_temp_delta = q_temp_delta + nk / (bk * base_temp_i)
end do
sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
mecej4 wrote:
Let us get all the evidence before we announce the verdict. You have not shown more than a small fragment of code, and we do not know the gamut of values that the variables in that fragment may take. Here is a small example to show the difference between mod() and modulo(). Consider making up a small, self-contained example built around the code fragment that you showed. As you have observed, the code with the modulo() could give you significant speed-up because of vectorization.
program whichMod integer i,j,k1,k2 write(*,'(A,/)')' i N mod(N,15) modulo(N,15)' do i=-5,5,2 j=3*i-4 k1=mod(j,15) k2=modulo(j,15) write(*,'(4I10)')i,j,k1,k2 end do end program
Hi mecej4. I understand why MOD and MODULO functions gave me wrong results. Simply because variable nk was not calculated. See my post above.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
>>sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized
?? this is an iterative process, same number of iterations on each lane of the vector, each lane computing its own unique value. It should vectorize.
Lets help it along...
Try adding !DIR$ SIMD PRIVATE(sk, bk, nk, base_temp_i, q_temp_delta) to the do i = 1,n_temp loop
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
jimdempseyatthecove wrote:
>>sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized
?? this is an iterative process, same number of iterations on each lane of the vector, each lane computing its own unique value. It should vectorize.
Lets help it along...
Try adding !DIR$ SIMD PRIVATE(sk, bk, nk, base_temp_i, q_temp_delta) to the do i = 1,n_temp loop
Jim Dempsey
Hi Jim,
By using yout suggest, the loop is now vectorized but the results are wrong. I tried:
!DIR$ SIMD PRIVATE(sk, bk, nk, base_tmp_i, q_tmp_delta)
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = aint(base_tmp_i**j)
nk = aint( sk / bk )
sk = sk - nk * bk
q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
end do
And:
!DIR$ SIMD PRIVATE(sk, bk, nk, base_tmp_i, q_tmp_delta)
do j = npow, 0, -1 ! npow the same for all in base_temp(:)
bk = base_tmp_i**j
nk = floor( sk / bk )
sk = sk - nk * bk
q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
end do
Disabling DIR SIMD, code is not vectorized and results are good. It seems the vectorization causes wrong results.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Let's take another attempt at skinning the cat.
real function q_temp_delta(n, npow, base_temp_i) !dir$ addributes vector :: q_temp_delta implicit none integer, intent(in) :: n,npow real, intent(in) :: base_temp_i ! local variables integer :: sk, bk, nk sk = n q_temp_delta = 0.0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp_i**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i) end do end function q_temp_delta ... do i = 1, base_dim x = log( real(n + 1) / real(base(i)) ) npows(i) = floor(x) end do do npow = minval(npows), maxval(npows) n_temp = 0 do i = 1, base_dim if(npows(i) == npow) then n_temp = n_temp + 1 scattergather(n_temp) = i end if end do if(n_temp > 0) then base_temp(1:n_temp) = base(scattergather(1:n_temp)) q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp)) do i=1,n_temp ! width of base_temp(:) and q_temp(:) q_temp(i) = q_temp(i) + q_temp_delta(n, npow, base_temp(i)) end do q(scattergather(1:n_temp)) = q_temp(1:n_temp) end if end do
remove now unnecessary PRIVATE(...)
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Jim,
many thanks for your great help, but the code still not working well (results are bad), and your last versione is not vectorized. The compiler still tell:
remark: loop was not vectorized: existence of vector dependence. module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 123 and sk line 123. module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 123 and sk line 123. module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 123 and sk line 122. module_Quasirandom.f90(122): (col. 12) remark: vector dependence: assumed ANTI dependence between sk line 122 and sk line 123.
The involved lines are:
nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk
Could it depends because I'm using OpenMP with 1 thread? The entire routine is under a parallel region, and some vector (like base, base_tmp, q_tmp, scatter_gather, npows, base_dim) are THREADPRIVATE.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Do not use the any thread private variables inside the function I provided.
consider adding PURE attribute to function name and adding UNIFORM clause to !DIR$
PURE real function q_temp_delta(n, npow, base_temp_i) !dir$ addributes vector:UNIFORM(n, npow) :: q_temp_delta implicit none integer, intent(in) :: n,npow real, intent(in) :: base_temp_i ! local variables integer :: sk, bk, nk sk = n q_temp_delta = 0.0 do j = npow, 0, -1 ! npow the same for all in base_temp(:) bk = base_temp_i**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i) end do end function q_temp_delta
Also, if floor continues to be an issue, replace it with your own floor equivalent code that does not have branch flow statements.
real_sk_over_bk = real(sk) / real(bk)
if(mod(real_sk_over_bk, 1.0) .ne. 0.0) real_sk_over_bk = real_sk_over_bk - 1.0
nk = real_sk_over_bk
Add the real variable real_sk_over_bk (change the name if you want), but do not rewrite the code.
The above code can generate and use the vector test and mask move instructions
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Now, I'm not using any thread variable. This is the compiler report (a part):
module_Quasirandom.f90(111): (col. 22) remark: FUNCTION WAS VECTORIZED.
module_Quasirandom.f90(111): (col. 22) remark: FUNCTION WAS VECTORIZED.
module_Quasirandom.f90(121): (col. 5) remark: loop was not vectorized: existence of vector dependence.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 127 and sk line 127.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 127 and sk line 127.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 127 and sk line 124.
module_Quasirandom.f90(124): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 124 and sk line 127.
And results are still wrong. At line 111, there is q_tmp_delta function. At this point, I suspect a bug in my code in another section.