Simplify function but code slows down instead of speed up

Wee_Beng_T_ · ‎05-12-2008

Hi,

I did a profiling using the -p option. From the results, I found that about 22% of the time is spent executing a particularfunction phi_f() given by

real(8) function phi_f(u1,u2,d1,d2)

real(8), intent(in) :: u1,u2,d1,d2

phi_f = (d1*u2+d2*u1)/(d1+d2)

end function phi_f

For e.g. phi_f(u(i,j),u(i+1,j),cu(i,j)%pd_E,cu(i+1,j)%pd_W)

Since d1,d2 are constants at each time step and only u1 and u2 change, I decided to simplify the function by changing it to

real(8) function phi_f_new(u1,u2,d1,d2,inv_d)

real(8), intent(in) :: u1,u2,d1,d2,inv_d

phi_f_new = (d1*u2+d2*u1)*inv_d

end function phi_f_new

where inv_d=1./(d1+d2).

For e.g. phi_f_new(u(i,j),u(i+1,j),cu(i,j)%pd_E,cu(i+1,j)%pd_W,cu(i,j)%inv_d_PE)

and then to

real(8) function phi_f_new2(u1,u2,dd1,dd2)

real(8), intent(in) :: u1,u2,dd1,dd2

phi_f_new2 = dd1*u2+dd2*u1

end function phi_f_new2

dd1 and dd2 are acutally d1/(d1+d2) and d2/(d1+d2). They are computed initially and stored.

For e.g. phi_f_new2(u(i,j),u(i+1,j),cu(i,j)%ratio_Pe_PE,cu(i+1,j)%ratio_Pw_PW)

I thought that using the 2nd function (phi_f_new) will be faster than the 1st (phi_f) and the 3rd function (phi_f_new2) will be faster than the 2nd. However, after doing profiling again, I found that phi_f_new is faster than phi_f, but phi_f_new2 is slower than phi_f_new.

I thought the operation counts have reduced and it should be faster, shouldn't it?

It seems that the difference is smaller when I increase the no. of grid points, although it's still slower.

Ron_Green · ‎05-13-2008

I'll ask the obvious question: if this trivial function is 22% of your runtime, why not get it to inline with -ip or -ipo?

I think tweaking the function will give some marginal improvement, but if you can inline it you may see a very very dramatic speedup for the application.

ron

Wee_Beng_T_ · ‎05-13-2008

Thank you for your suggestion Ron!

However, I have already used -ip during compiling. Btw, my compiler options are:

-p -O3 -ip -r8 -w95 -c -static-libcxa

Steven_L_Intel1 · ‎05-13-2008

Tried the -x option appropriate for your processor?

Wee_Beng_T_ · ‎05-13-2008

Thanks Steve!

I just did a profiling using the new option -xP since I'm using c2d processor. There seems to be a slight increase for some routines although it's not really conclusive.

I then did -ipo and there is a 20% increase! Moreover I found that the profiling output is now pretty different after using -ipo. I can't find the phi_f function anymore.

Since we are at optimization, what's the difference between -ip and -ipo? Why is the profiling output so different?

Another thing is after using -ipo, I found that using the original phi_f is even slightly faster than phi_f_new or phi_f_new2.

Any idea why simplifying the function I mention earlier slows down the speed instead? I still don't understand.

Steven_L_Intel1 · ‎05-14-2008

-ip does analysis within a single source file only, and in recent compilers, a "light" form of -ip is done by default, so you nmay not see much benefit from -ip.

-ipo analyzes the entire application across all source files. It's as if you merged all the sources into one file and compiled that one file with -ip. -ipo provides much more information to the optimizer and allows it to make better inlining and vectorization decisions. I'd guess that with -ipo, phi_f is being completely inlined.

Ron_Green · ‎05-14-2008

Glad to see that the inlining helped. I was pretty sure it would.

One last point on the code 'simplification': Modern compilers are quite clever at doing all the things you were attempting to do by hand to reduce the instruction count, convert division to reciprocal multiplication, etc. On top of that, advances in chip technology are also changing the rules: on the latest processors, the division can sometimes be almost as fast as a mult by a reciprocal approximation. In the future, who knows, might be as fast or faster.

In addition, the compiler can and does do things with register allocations of variables, streaming stores, vectorizations, etc. Things that would take a clever programmer a LONG time to do. Recently on our Forums we had a customer that hand-coded assembly to try to outdo the compiler. It turned out the compiled code and his hand-tuned assembly were approximately equal. He was stunned, to say the least.

Like a lot of Fortran programmers, we cut our teeth in a time when we didn't trust compilers and did a LOT of modifications to the code to outguess the compiler and tune the code to a particular architecture. I remember rewriting a lot of key routines in DEC assembler to get the extra boost I needed in my application performance. The downside is obvious: those inheriting my old codes have probably cursed me to no end.

What I'm getting at: these days the best advice is to write your code in the most straightforward way, a way that reflects the algorithm. Modern compilers do a bang-up job on the optimization, often as good or better than you could do by hand. And even if you do some clever things today, a new chip advance may make your effort irrelevant. Also the code modifications often obscure the underlying algorithm and are counter to the goal of maintainable code.

ron

Wee_Beng_T_ · ‎05-14-2008

Thank you for the enlightenment Steve and Ron! I now have a better understand of the compiler and their capabilities.