- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any way to speed up the following code?
[fortran]do i=1,1000000 s0=i*10+1 s1=i*10+10 do j=s0,s1 val(i) = val(i) + a(j)*c(j) enddo enddo[/fortran]
[fortran]do i=1,1000000 s0=i*10+1 s1=i*10+10 do j=s0,s1 val(i) = val(i) + a(j)*c(j) enddo enddo[/fortran]
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It may help to replace the inner loop by
!dir$ unroll(0)
val(i)=val(i)+dot_product(a(so:s1),c(s0:s1))
ifort should optimize this with default options. -assume protect_parens will not prevent this optiimation, but -fp-model source et al. will inhibit "vectorization."
Alternative to portable dot_product is intel-specific !dir$ simd reduction.
As your loop length is only 10, you dont't get full benefit of compiler vectorization when you can't specify data alignment. Compiler may say "seems inefficient" when it sees loop length 10 with unknown alignment.
Other traditional possibilites include writing it out in various non-sequential forms:
val(i)=val(i)+dot_product(a(s0:s0+1),c(so:s0+1)+dot_product(a(s0+2:s0+9),c(s0+2:s0+9)) .....
this also may depend strongly on alignment if any vectorization is involved.
Ideally, the compiler would not invoke alignment dependence when you use an option such as -xHost which should take advantage of current instruction sets.
!dir$ unroll(0)
val(i)=val(i)+dot_product(a(so:s1),c(s0:s1))
ifort should optimize this with default options. -assume protect_parens will not prevent this optiimation, but -fp-model source et al. will inhibit "vectorization."
Alternative to portable dot_product is intel-specific !dir$ simd reduction.
As your loop length is only 10, you dont't get full benefit of compiler vectorization when you can't specify data alignment. Compiler may say "seems inefficient" when it sees loop length 10 with unknown alignment.
Other traditional possibilites include writing it out in various non-sequential forms:
val(i)=val(i)+dot_product(a(s0:s0+1),c(so:s0+1)+dot_product(a(s0+2:s0+9),c(s0+2:s0+9)) .....
this also may depend strongly on alignment if any vectorization is involved.
Ideally, the compiler would not invoke alignment dependence when you use an option such as -xHost which should take advantage of current instruction sets.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[#]$ ifort loop.f90
loop.f90(7): warning #7866: The statement following this DEC loop optimization directive must be an iterative do-stmt, a vector assignment, an OMP pdo-directive, or an OMP parallel-do-directive.
!dir$ unroll(0)
------^
Is there anything I make mistake?
loop.f90(7): warning #7866: The statement following this DEC loop optimization directive must be an iterative do-stmt, a vector assignment, an OMP pdo-directive, or an OMP parallel-do-directive.
!dir$ unroll(0)
------^
Is there anything I make mistake?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting GHui
Is there any way to speed up the following code?...
...
do i = 1, 1000000
s0 = i * 10 + 1
s1 = s0 + 9
do j = s0, s1
val(i) = val(i) + a(j) * c(j)
enddo
enddo
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, apparently you will need to omit the unroll directive when trying dot_product.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I try dot_product for several cases. There isn't better performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- doi=1,1000000
- s0=i*10+1
- s1=i*10+10
- doj=s0,s1
Shouldn't your i loop be doi=0,1000000?
(or i loop control reduced by /10?)
Also, are val(:), a(:) and c(:) dimensioned at 1000000*10+10?
IOW 10x the size of the i loop iteration plus 10
Try creating temp array of proper size (same as size for val(:), a(:) and c(:))
! the following will vectorize quite well
! do the product part of the dot products
do i=1,1000000*10+10
temp(i) = a(i) * c(i)
end do
! now do the sum part of the small dot products
do i=1,1000000
s0=i*10+1
s1=i*10+10
t = 0.0
do j=s0,s1
t = t + temp(j)
end do
val(i) = val(i) + t
end do
Start with that
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page