- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When either of the operands to ippsAtan2_32f is 0, the operation takes many times longer than it when both operands are non-zero. On my AMD 64X2, it takes 3x as long (65 cycles per element, versus 22 cycles). On a Xeon X5680, it takes TEN TIMES as long (207 cycles versus 21).
I find this very odd, since the results in either case are constant (0 when X=0, pi/2 when Y=0). The atan2 function in Microsoft's C run-time library takes half the time when an operand is 0.
I'm going to try scanning through the vectors to special-case zero elements, but I'm dubious that is a net win. Anyone have any suggestions?
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
I find this very odd, since the results in either case are constant (0 when X=0, pi/2 when Y=0). The atan2 function in Microsoft's C run-time library takes half the time when an operand is 0.
I'm going to try scanning through the vectors to special-case zero elements, but I'm dubious that is a net win. Anyone have any suggestions?
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
The algorithm for atan2 has special code path for handling zeros. Different combinations of zero-nonzero arguments yield different special case results and they are all handled outside of the main path algorithm. This is vector function specific: we use SIMD commands to gain maximum performance, but this means we have to apply same algorithm to all inputs.This same algorithm is by design branch-free (to avoid misprediction penalties) and we strive to make it applicable for widest possible range of arguments. Still making this algorithm uniform for very different cases has performance implications. And we choose to take a hit of branch mispredict for subtle cases (e.g. zeros) versus slowing down all values in a uniform algorithm.
In case you have a lot of zeros in your vector you may consider couple opportunities: a) filter them out bevore calling a vector function b) call scalar function in a loop e.g. atan2f from math.h (or mathimf.h if you are using Intel Compiler).
Nikita
The algorithm for atan2 has special code path for handling zeros. Different combinations of zero-nonzero arguments yield different special case results and they are all handled outside of the main path algorithm. This is vector function specific: we use SIMD commands to gain maximum performance, but this means we have to apply same algorithm to all inputs.This same algorithm is by design branch-free (to avoid misprediction penalties) and we strive to make it applicable for widest possible range of arguments. Still making this algorithm uniform for very different cases has performance implications. And we choose to take a hit of branch mispredict for subtle cases (e.g. zeros) versus slowing down all values in a uniform algorithm.
In case you have a lot of zeros in your vector you may consider couple opportunities: a) filter them out bevore calling a vector function b) call scalar function in a loop e.g. atan2f from math.h (or mathimf.h if you are using Intel Compiler).
Nikita
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
what version of IPP do you use? Does that effect take place on all variants of atan2 function (ippsAtan2_32f_A11, ippsAtan2_32f_A21 and ippsAtan2_32f_A24)?
Regards,
Vladimir
what version of IPP do you use? Does that effect take place on all variants of atan2 function (ippsAtan2_32f_A11, ippsAtan2_32f_A21 and ippsAtan2_32f_A24)?
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using IPP 6.1.
Good question regarding the precision. I was using the A11 variant, but I just checked the others. The A24 variant also has a penalty when the parameters are 0, but the penalty is smaller.
The A21 variant behaves differently. I don't see a penalty when it is exactly 0, but both of the values are small (but non-zero), the 3x penalty is there.
If this were an iterative algorithm, I might expect that some combinations take longer to converge, but I thought this was a straight-line polynomial. Hence, my surprise. Could this be triggering overflow or underflow?
Tim Roberts
Good question regarding the precision. I was using the A11 variant, but I just checked the others. The A24 variant also has a penalty when the parameters are 0, but the penalty is smaller.
The A21 variant behaves differently. I don't see a penalty when it is exactly 0, but both of the values are small (but non-zero), the 3x penalty is there.
If this were an iterative algorithm, I might expect that some combinations take longer to converge, but I thought this was a straight-line polynomial. Hence, my surprise. Could this be triggering overflow or underflow?
Tim Roberts
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
which libraries are you using - IA32 or Intel64? Did you use emerged libs? If yes, did you use ippInit function in your code?
Andrey
which libraries are you using - IA32 or Intel64? Did you use emerged libs? If yes, did you use ippInit function in your code?
Andrey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
The algorithm for atan2 has special code path for handling zeros. Different combinations of zero-nonzero arguments yield different special case results and they are all handled outside of the main path algorithm. This is vector function specific: we use SIMD commands to gain maximum performance, but this means we have to apply same algorithm to all inputs.This same algorithm is by design branch-free (to avoid misprediction penalties) and we strive to make it applicable for widest possible range of arguments. Still making this algorithm uniform for very different cases has performance implications. And we choose to take a hit of branch mispredict for subtle cases (e.g. zeros) versus slowing down all values in a uniform algorithm.
In case you have a lot of zeros in your vector you may consider couple opportunities: a) filter them out bevore calling a vector function b) call scalar function in a loop e.g. atan2f from math.h (or mathimf.h if you are using Intel Compiler).
Nikita
The algorithm for atan2 has special code path for handling zeros. Different combinations of zero-nonzero arguments yield different special case results and they are all handled outside of the main path algorithm. This is vector function specific: we use SIMD commands to gain maximum performance, but this means we have to apply same algorithm to all inputs.This same algorithm is by design branch-free (to avoid misprediction penalties) and we strive to make it applicable for widest possible range of arguments. Still making this algorithm uniform for very different cases has performance implications. And we choose to take a hit of branch mispredict for subtle cases (e.g. zeros) versus slowing down all values in a uniform algorithm.
In case you have a lot of zeros in your vector you may consider couple opportunities: a) filter them out bevore calling a vector function b) call scalar function in a loop e.g. atan2f from math.h (or mathimf.h if you are using Intel Compiler).
Nikita
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page