- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using Ifort to perform scientific calculations.
According to Valgrind, the most cpu costly subroutine is the derivative routine, which perform a simple 2D convolution :
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do
I am trying to use BLAS library to accelerate this calculation, but I failed to find the appropriate way.
I think the best way would be to unrolle the loop, and then use BLAS 2 Vector/Matrix calculations, but I am not sure about that.
Does someone as an idea on how to optimize this calculation ?
I am using Ifort to perform scientific calculations.
According to Valgrind, the most cpu costly subroutine is the derivative routine, which perform a simple 2D convolution :
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do
I am trying to use BLAS library to accelerate this calculation, but I failed to find the appropriate way.
I think the best way would be to unrolle the loop, and then use BLAS 2 Vector/Matrix calculations, but I am not sure about that.
Does someone as an idea on how to optimize this calculation ?
Link Copied
17 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This doesn't look like a BLAS operation nor what is usually described as convolution. Unless you have hidden something from us, it does look as if it should easily auto-vectorize, and (assuming Nx2 is large enough) should easily benefit from parallelization of the outer loop (by OpenMP or -parallel).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I thought that was what it's called convolution. I do apology.
In fact, my code is already using IFORT MPI to perform huge calculations on 128 procs, but even with many cores, the calculation of that loop is cpu costly because the array F is not read perfectly (when to compute DF(i,j) I read F(i,j-1), F(i,j+1), F(i,j), F(i+1,j) and F(i-1,j), these values are not stored at nearby places in memory, so there are discontinuities).
I have read somewhere that BLAS can optimize this type of calculations. Do you think I am looking the wrong library ?
I thought that was what it's called convolution. I do apology.
In fact, my code is already using IFORT MPI to perform huge calculations on 128 procs, but even with many cores, the calculation of that loop is cpu costly because the array F is not read perfectly (when to compute DF(i,j) I read F(i,j-1), F(i,j+1), F(i,j), F(i+1,j) and F(i-1,j), these values are not stored at nearby places in memory, so there are discontinuities).
I have read somewhere that BLAS can optimize this type of calculations. Do you think I am looking the wrong library ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The right hand side of the assignment reminds one of the discretized Laplacian in 2D.
As to improving the calculation: what can you tell us about the array A?
Do the elements of A change with i,j?
Are most of the elements A(:,:,1) the same in value? Similarly for A(:,:,2) and A(:,:,3)?
As to improving the calculation: what can you tell us about the array A?
Do the elements of A change with i,j?
Are most of the elements A(:,:,1) the same in value? Similarly for A(:,:,2) and A(:,:,3)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest not unrolling the loop.
Instead, "shear" the loop
Try the above and please report the effect on performance.
Jim Dempsey
Instead, "shear" the loop
[bash]do j=1,Nx2 do i=1,Nx1 DF(i,j) = A(i,j,1) * F(i,j) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = A(i,j,1) * F(i,j) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1) end do end do do j=1,Nx2 do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1) end do end do [/bash]
Try the above and please report the effect on performance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also try shearing the inner loop
Jim Dempsey
[bash]do j=1,Nx2 do i=1,Nx1 DF(i,j) = A(i,j,1) * F(i,j) end do do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j) end do do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j) end do do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1) end do do i=1,Nx1 DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1) end do end do [/bash]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Either approach will reduce the pressure on the TLB's (Translation Look-aside Buffers)
Jim Dempsey
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As to improving the calculation: what can you tell us about the array
A?
In fact, I have two routines that use this type of calculation :
- the Laplacian discretization as you said before (in a preconditioned BICGSTAB) to solve Poisson equation.
- the Finite Difference 4 for derivatives (but with a larger stencil : from F(i-2) to F(i+2) )
In the Laplacian, coefficients of A(i,j,1) are not the same due to immersed bodies in the CFD simulation. That is why I use BICGSTAB instead of a simple PCG. (A(i-1,j,1) /= A(i,j,1) /= A(i,j,2), etc.)
In the FD4, coefficients with the same last argument are all the sames. (A(i,j,1) = A(i-1,j,1) = A(i,j+1,1) but A(i,j,1) /= A(i,j,2) )
So you can consider :
Poisson :
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do
FD2 (FD4 in normal time, but FD2 for this example):
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(1) * F(i,j) + A(2) * F(i-1,j) + A(3) * F(i+1,j)+ A(4) * F(i,j-1)+ A(5) * F(i,j+1)
end do
end do
Try the above and please report the effect on performance.
I will do it now, and I will report the effects with -O4.
In fact, I have two routines that use this type of calculation :
- the Laplacian discretization as you said before (in a preconditioned BICGSTAB) to solve Poisson equation.
- the Finite Difference 4 for derivatives (but with a larger stencil : from F(i-2) to F(i+2) )
In the Laplacian, coefficients of A(i,j,1) are not the same due to immersed bodies in the CFD simulation. That is why I use BICGSTAB instead of a simple PCG. (A(i-1,j,1) /= A(i,j,1) /= A(i,j,2), etc.)
In the FD4, coefficients with the same last argument are all the sames. (A(i,j,1) = A(i-1,j,1) = A(i,j+1,1) but A(i,j,1) /= A(i,j,2) )
So you can consider :
Poisson :
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do
FD2 (FD4 in normal time, but FD2 for this example):
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(1) * F(i,j) + A(2) * F(i-1,j) + A(3) * F(i+1,j)+ A(4) * F(i,j-1)+ A(5) * F(i,j+1)
end do
end do
Try the above and please report the effect on performance.
I will do it now, and I will report the effects with -O4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried with two compilers, and results are not the same.
My processors is a Intel Xeon CPU X5570 2.93GHz
The code is the following :
My processors is a Intel Xeon CPU X5570 2.93GHz
The code is the following :
program perfs
implicit none
integer, parameter :: Nx1=512,Nx2=512
real(8), allocatable, dimension(:,:) :: F,DF
real(8), allocatable, dimension(:,:,:) :: A
integer :: i,j,n
real(8) :: time1,time2,totaltime
allocate(F(0:Nx1+1,0:Nx2+1),DF(1:Nx1,1:Nx2))
allocate(A(1:Nx1,1:Nx2,1:5))
F(:,:) = dacos(1.0d0)
! suppose A coefficents are constant in space
A(:,:,1) = 1.0d0
A(:,:,2) = 2.0d0
A(:,:,3) = 3.0d0
A(:,:,4) = 4.0d0
A(:,:,5) = 5.0d0
call cpu_time(time1)
do n=1,1000
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"Test1 :",totaltime
call cpu_time(time1)
do n=1,1000
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"Test2 :",totaltime
call cpu_time(time1)
do n=1,1000
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"Test3 :",totaltime
deallocate(F,DF)
end program perfs
[\CODE]
With Ifort (non MPI version) :
bl@galilee:~/Desktop$ ifort -O4 perfs.f90
bl@galilee:~/Desktop$ ./a.out
Test1 : 1.32000000000000
Test2 : 1.35000000000000
Test3 : 0.910000000000000
bl@galilee:~/Desktop$ ./a.out
Test1 : 1.33000000000000
Test2 : 1.34000000000000
Test3 : 0.910000000000000
bl@galilee:~/Desktop$
With Gfortran :
bl@galilee:~/Desktop$ gfortran -O4 perfs.f90
bl@galilee:~/Desktop$ ./a.out
Test1 : 1.3799999999999999
Test2 : 1.8300000000000001
Test3 : 1.5599999999999996
bl@galilee:~/Desktop$ ./a.out
Test1 : 1.3699999999999999
Test2 : 1.8400000000000003
Test3 : 1.5600000000000001
bl@galilee:~/Desktop$
The third way has clearly improved performances with Ifort. However, I need the code to be the same on all Intel platforms, including the cases when others compilers are used... (i.e. the huge distant supercalculator that I use sometimes and that isn't using ifort).
I will keep this trick in mind and make a special Ifort routine for the local cluster.
Any other ideas to improve this loop ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A simple remark : the last test was greatly improved using this compilation option :
ifort -O4 perfs.f90
Test3 : 0.920000000000000
ifort -O4 -xSSE4.2 perfs.f90
Test3 : 0.530000000000000
It is now 2.5 times faster than the original routine.
However, it as no inpact on the two first tests.
ifort -O4 perfs.f90
Test3 : 0.920000000000000
ifort -O4 -xSSE4.2 perfs.f90
Test3 : 0.530000000000000
It is now 2.5 times faster than the original routine.
However, it as no inpact on the two first tests.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RE: SSE4.2
The inner most loops of test 2 and the inner most loops of test 3 are effectively the same.
The difference in the behavior of the two loops is test3 has a smaller hot in cachefootprint.
For gfort, try SSE *something supported less than 4.2"
I suspect gfort is not vectorizing the inner loops as there appears to be no time difference.
Jim
The inner most loops of test 2 and the inner most loops of test 3 are effectively the same.
The difference in the behavior of the two loops is test3 has a smaller hot in cachefootprint.
For gfort, try SSE *something supported less than 4.2"
I suspect gfort is not vectorizing the inner loops as there appears to be no time difference.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It has no effects.
My GFortran is :
$ gfortran --version
GNU Fortran (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I tried with the following options, all together, and separately : (My processors are Nehalem Xeon)
gfortran -O4 -mtune=barcelona -mfpmath=sse -msse4.2 -mmmx perfs.f90
Is there a way to rewritte the loop in a vectorized shape ? In order to make GFortran run a little faster.
Or maybe I should try to compute values in blocks of data (considering the fact that my block uses less than 90% of the L3 cache) ? If the array is 512*512 size, then I will for example compute blocks of 128*128 so that all data used resides in the cache (assuming I made the size calculation before and 128*128*numberofarrays uses less than 90% of L3). Could it improve performances ?
My GFortran is :
$ gfortran --version
GNU Fortran (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I tried with the following options, all together, and separately : (My processors are Nehalem Xeon)
gfortran -O4 -mtune=barcelona -mfpmath=sse -msse4.2 -mmmx perfs.f90
Is there a way to rewritte the loop in a vectorized shape ? In order to make GFortran run a little faster.
Or maybe I should try to compute values in blocks of data (considering the fact that my block uses less than 90% of the L3 cache) ? If the array is 512*512 size, then I will for example compute blocks of 128*128 so that all data used resides in the cache (assuming I made the size calculation before and 128*128*numberofarrays uses less than 90% of L3). Could it improve performances ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you don't want the n=1,1000 loop to be executed 1000 times, simply leave it out.
gfortran is more likely to take such code literally than commercial compilers, for which there is a demand to "cheat" overly simplified benchmarks.
If you're looking for hints specific to gfortran, the gnu Fortran mailing list is the place.
Presumably, -O4 is treated the same as -O3 by the compilers you cite, but why take a chance on undocumented behavior?
gfortran is more likely to take such code literally than commercial compilers, for which there is a demand to "cheat" overly simplified benchmarks.
If you're looking for hints specific to gfortran, the gnu Fortran mailing list is the place.
Presumably, -O4 is treated the same as -O3 by the compilers you cite, but why take a chance on undocumented behavior?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
why take a chance on undocumented behavior?
That is right. I will try this implementation in the main code and see if performances are really improved or not.
In fact I was wondering if I could have the same improvement in calculations on my local workstation (GFortran) and on the cluster (IFORT). I prefer having one subroutine for all compilers than having one specific for each compiler.
However, I consider that the objective of this topic has been achieved because heavy calculations are only done on the cluster (IFORT).
I would like to thank all of you for your help in this topic. It will allow us to perform larger simulations with the same amount of CPU hours, which are a rare resource here. :)
Ben
That is right. I will try this implementation in the main code and see if performances are really improved or not.
In fact I was wondering if I could have the same improvement in calculations on my local workstation (GFortran) and on the cluster (IFORT). I prefer having one subroutine for all compilers than having one specific for each compiler.
However, I consider that the objective of this topic has been achieved because heavy calculations are only done on the cluster (IFORT).
I would like to thank all of you for your help in this topic. It will allow us to perform larger simulations with the same amount of CPU hours, which are a rare resource here. :)
Ben
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would appear that test 3 method (with SSE4.2 if using IFORT) is the way to go. The effect on gfort is benign (i.e. no worse off). In the later event that improvements are made in gfort with respect to vectorizing this subroutine, then you will get your additional performance with no additional code change. Your build configuration options will have to vary from platform to platform and or compiler to compiler.
Jim
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for this reply.
GFortran is used only for non MPI or low MPI calculations (up to 8 cores), so performances are non that important.
I will proceed as you said, and keep one subroutine for all compilers.
Build configuration is already specific for each architecture we use, I just need to add SSE4.2 option.
This improvement will allow me to simulate larger fields with our local clusters.
An other time, I really thank you for this idea of loop reorganisation. :)
Ben
GFortran is used only for non MPI or low MPI calculations (up to 8 cores), so performances are non that important.
I will proceed as you said, and keep one subroutine for all compilers.
Build configuration is already specific for each architecture we use, I just need to add SSE4.2 option.
This improvement will allow me to simulate larger fields with our local clusters.
An other time, I really thank you for this idea of loop reorganisation. :)
Ben
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ben,
Look in the ISN Blogs section in the next few days/weeks, I will be submitting an article illustrating a high level of optimization. I've collected some impressive run data on a Core i7 920 but would like to find some time on a larger SMP NUMA system. If you know of someone with a few hours of free system time I would appreciate a referral.
Jim Dempsey
Look in the ISN Blogs section in the next few days/weeks, I will be submitting an article illustrating a high level of optimization. I've collected some impressive run data on a Core i7 920 but would like to find some time on a larger SMP NUMA system. If you know of someone with a few hours of free system time I would appreciate a referral.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you confirm that this is the good place to look ? :
http://software.intel.com/en-us/blogs/
Unfortunately, the only computer I am the administrator of is my own workstation and it is exactly the same as your Core I7 920. Mine is an I7 XEON X5570. The only structure I know are the CRIHAN and the IDRIS, but you need to go through a lot of forms and administration papers to be allow to use them. (Administration is our national favorite sport...)
I am impatient to take a look at your paper !
Ben
http://software.intel.com/en-us/blogs/
Unfortunately, the only computer I am the administrator of is my own workstation and it is exactly the same as your Core I7 920. Mine is an I7 XEON X5570. The only structure I know are the CRIHAN and the IDRIS, but you need to go through a lot of forms and administration papers to be allow to use them. (Administration is our national favorite sport...)
I am impatient to take a look at your paper !
Ben

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page