- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I did a profiling using the -p option. From the results, I found that about 22% of the time is spent executing a particularfunction phi_f() given by
real(8) function phi_f(u1,u2,d1,d2)
real(8), intent(in) :: u1,u2,d1,d2
phi_f = (d1*u2+d2*u1)/(d1+d2)
end function phi_f
For e.g. phi_f(u(i,j),u(i+1,j),cu(i,j)%pd_E,cu(i+1,j)%pd_W)
Since d1,d2 are constants at each time step and only u1 and u2 change, I decided to simplify the function by changing it to
real(8) function phi_f_new(u1,u2,d1,d2,inv_d)
real(8), intent(in) :: u1,u2,d1,d2,inv_d
phi_f_new = (d1*u2+d2*u1)*inv_d
end function phi_f_new
where inv_d=1./(d1+d2).
For e.g. phi_f_new(u(i,j),u(i+1,j),cu(i,j)%pd_E,cu(i+1,j)%pd_W,cu(i,j)%inv_d_PE)
and then to
real(8) function phi_f_new2(u1,u2,dd1,dd2)
real(8), intent(in) :: u1,u2,dd1,dd2
phi_f_new2 = dd1*u2+dd2*u1
end function phi_f_new2
dd1 and dd2 are acutally d1/(d1+d2) and d2/(d1+d2). They are computed initially and stored.
For e.g. phi_f_new2(u(i,j),u(i+1,j),cu(i,j)%ratio_Pe_PE,cu(i+1,j)%ratio_Pw_PW)
I thought that using the 2nd function (phi_f_new) will be faster than the 1st (phi_f) and the 3rd function (phi_f_new2) will be faster than the 2nd. However, after doing profiling again, I found that phi_f_new is faster than phi_f, but phi_f_new2 is slower than phi_f_new.
I thought the operation counts have reduced and it should be faster, shouldn't it?
It seems that the difference is smaller when I increase the no. of grid points, although it's still slower.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think tweaking the function will give some marginal improvement, but if you can inline it you may see a very very dramatic speedup for the application.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your suggestion Ron!
However, I have already used -ip during compiling. Btw, my compiler options are:
-p -O3 -ip -r8 -w95 -c -static-libcxa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Steve!
I just did a profiling using the new option -xP since I'm using c2d processor. There seems to be a slight increase for some routines although it's not really conclusive.
I then did -ipo and there is a 20% increase! Moreover I found that the profiling output is now pretty different after using -ipo. I can't find the phi_f function anymore.
Since we are at optimization, what's the difference between -ip and -ipo? Why is the profiling output so different?
Another thing is after using -ipo, I found that using the original phi_f is even slightly faster than phi_f_new or phi_f_new2.
Any idea why simplifying the function I mention earlier slows down the speed instead? I still don't understand.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-ipo analyzes the entire application across all source files. It's as if you merged all the sources into one file and compiled that one file with -ip. -ipo provides much more information to the optimizer and allows it to make better inlining and vectorization decisions. I'd guess that with -ipo, phi_f is being completely inlined.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One last point on the code 'simplification': Modern compilers are quite clever at doing all the things you were attempting to do by hand to reduce the instruction count, convert division to reciprocal multiplication, etc. On top of that, advances in chip technology are also changing the rules: on the latest processors, the division can sometimes be almost as fast as a mult by a reciprocal approximation. In the future, who knows, might be as fast or faster.
In addition, the compiler can and does do things with register allocations of variables, streaming stores, vectorizations, etc. Things that would take a clever programmer a LONG time to do. Recently on our Forums we had a customer that hand-coded assembly to try to outdo the compiler. It turned out the compiled code and his hand-tuned assembly were approximately equal. He was stunned, to say the least.
Like a lot of Fortran programmers, we cut our teeth in a time when we didn't trust compilers and did a LOT of modifications to the code to outguess the compiler and tune the code to a particular architecture. I remember rewriting a lot of key routines in DEC assembler to get the extra boost I needed in my application performance. The downside is obvious: those inheriting my old codes have probably cursed me to no end.
What I'm getting at: these days the best advice is to write your code in the most straightforward way, a way that reflects the algorithm. Modern compilers do a bang-up job on the optimization, often as good or better than you could do by hand. And even if you do some clever things today, a new chip advance may make your effort irrelevant. Also the code modifications often obscure the underlying algorithm and are counter to the goal of maintainable code.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
![](/skins/images/8B6E2C8F64F54CBD7F7262AA46F575DA/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page