I have inherited an older HPC code with a lot of legacy solutions and work-arounds and I have been working for a while on updating the code. The code has been very slow, enough so that it has been almost unusable when problem sizes have grown, but I have been able to improve both the parallel and vector performance as well as to remove a few obvious things (like a**1.5 -> a*sqrt(a)).
I have used different profiling tools and a lot of internal standard tests and I have found out that a substantial part of the execution time is spent in a few core subroutines (impossible to vectorize or inline) that are called billions of times and in them calculating a number of cube roots now in the form
and also a few inverse cube roots of similar type.
My question is if there is a faster way to calculate the cube root in Fortran? I am aware of cbrt in C as well as the vector version in the MKL library, but unless I have to I would prefer (for really no good reason at all but gut feeling) not to get C involved unless necessary. What is the "common knowledge", to use cbrt from C, roll something from a numerical recipe (I have only found general Nth root) or just leave it be due to low gains? Precision is of course a factor but I can afford to loose a little precision, compared to other approximations done in the code.
The compiler already optimizes this to call a cube root routine. My advice is to not try to micromanage optimizations - let the compiler do it.
;;; subroutine foo (a,b) L1:: ;1.12 push rbp ;1.12 sub rsp, 32 ;1.12 mov rbp, rdx ;1.12 ;;; b = a**(1.0/3.0) movss xmm0, DWORD PTR [rcx] ;2.1 call cbrtf ;2.1
Andersson Per wrote:
.. My question is if there is a faster way to calculate the cube root in Fortran? ..
See an interesting discussion on this forum re: cube root calculations and recent changes with it in Intel Fortran compiler:
Thank you very much for your comments, and as I should have expected the compiler experts at Intel are miles ahead of me. I have written a small benchmark based on the discussion in the other thread I was recommended to look at and the results are clear. The compiler catches the cube root (and related operations like inverted cube roots) written in many different forms, and that operation is much faster than a general power operation with compiler options for fast math, and also much faster than calling cbrt from C through an interface. This forum rocks!