Solved: ippsAtan2 timing with 0 operands

Tim_Roberts · ‎10-28-2010

When either of the operands to ippsAtan2_32f is 0, the operation takes many times longer than it when both operands are non-zero. On my AMD 64X2, it takes 3x as long (65 cycles per element, versus 22 cycles). On a Xeon X5680, it takes TEN TIMES as long (207 cycles versus 21).

I find this very odd, since the results in either case are constant (0 when X=0, pi/2 when Y=0). The atan2 function in Microsoft's C run-time library takes half the time when an operand is 0.

I'm going to try scanning through the vectors to special-case zero elements, but I'm dubious that is a net win. Anyone have any suggestions?
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Nikita_A_Intel · ‎10-29-2010

Tim,
The algorithm for atan2 has special code path for handling zeros. Different combinations of zero-nonzero arguments yield different special case results and they are all handled outside of the main path algorithm. This is vector function specific: we use SIMD commands to gain maximum performance, but this means we have to apply same algorithm to all inputs.This same algorithm is by design branch-free (to avoid misprediction penalties) and we strive to make it applicable for widest possible range of arguments. Still making this algorithm uniform for very different cases has performance implications. And we choose to take a hit of branch mispredict for subtle cases (e.g. zeros) versus slowing down all values in a uniform algorithm.

In case you have a lot of zeros in your vector you may consider couple opportunities: a) filter them out bevore calling a vector function b) call scalar function in a loop e.g. atan2f from math.h (or mathimf.h if you are using Intel Compiler).

Nikita

View solution in original post

Vladimir_Dudnik · ‎10-28-2010

Hi Tim,

what version of IPP do you use? Does that effect take place on all variants of atan2 function (ippsAtan2_32f_A11, ippsAtan2_32f_A21 and ippsAtan2_32f_A24)?

Regards,
Vladimir

Tim_Roberts · ‎10-28-2010

I'm using IPP 6.1.

Good question regarding the precision. I was using the A11 variant, but I just checked the others. The A24 variant also has a penalty when the parameters are 0, but the penalty is smaller.

The A21 variant behaves differently. I don't see a penalty when it is exactly 0, but both of the values are small (but non-zero), the 3x penalty is there.

If this were an iterative algorithm, I might expect that some combinations take longer to converge, but I thought this was a straight-line polynomial. Hence, my surprise. Could this be triggering overflow or underflow?

Tim Roberts

Andrey_G_Intel2 · ‎10-29-2010

Tim,

which libraries are you using - IA32 or Intel64? Did you use emerged libs? If yes, did you use ippInit function in your code?

Andrey

Nikita_A_Intel · ‎10-29-2010

Tim,
The algorithm for atan2 has special code path for handling zeros. Different combinations of zero-nonzero arguments yield different special case results and they are all handled outside of the main path algorithm. This is vector function specific: we use SIMD commands to gain maximum performance, but this means we have to apply same algorithm to all inputs.This same algorithm is by design branch-free (to avoid misprediction penalties) and we strive to make it applicable for widest possible range of arguments. Still making this algorithm uniform for very different cases has performance implications. And we choose to take a hit of branch mispredict for subtle cases (e.g. zeros) versus slowing down all values in a uniform algorithm.

In case you have a lot of zeros in your vector you may consider couple opportunities: a) filter them out bevore calling a vector function b) call scalar function in a loop e.g. atan2f from math.h (or mathimf.h if you are using Intel Compiler).

Nikita