Solved: Slow IppsSin_64f_Axx

gol · ‎10-15-2009

I've noticed that IppsSin_64f_Axx *massively* decreases in performance as input values increase. I don't much understand why, I would guess that's in order to keeptherequiredaccuracy, even though I don't understand why there would be a loss if the phases were normally pre-wraped.

So this leads to my next question: is there any IPP function that can wrap phases, or most likely compute the fractional part of a buffer?

Or is there any known trick to quickly compute fractionals of double floats? By binary manipulating them maybe?

Right now I'm using CVTTPD2DQ / CVTDQ2PD/ SUBPD. Not very slow but if I can get anything better..

Thanks

Sergey_M_Intel2 · ‎10-15-2009

Hello,

Unfortunately this is the nature of trigonometric functions. Thebigger argument is thegreater intermediate precision is required to provide the result within guaranteed error bounds. Good summary of the problem can be found in open literature, e.g.David Defour, et al (2001) A new range reduction algorithm. http://www.imada.sdu.dk/~kornerup/papers/RR2.pdf

If you do not care about accuracy on large arguments you might want to use IppsSin_64f_A26 function.

Accuracy loss is the only tradeoffyou canmake here. There is no magic: simple tricks will not give you accuracy either. So probably better is to use A26 variant of IPP sine.

Hope this helps,
Regards,
Sergey Maidanov

View solution in original post

Sergey_M_Intel2 · ‎10-15-2009

Hello,

Unfortunately this is the nature of trigonometric functions. Thebigger argument is thegreater intermediate precision is required to provide the result within guaranteed error bounds. Good summary of the problem can be found in open literature, e.g.David Defour, et al (2001) A new range reduction algorithm. http://www.imada.sdu.dk/~kornerup/papers/RR2.pdf

If you do not care about accuracy on large arguments you might want to use IppsSin_64f_A26 function.

Accuracy loss is the only tradeoffyou canmake here. There is no magic: simple tricks will not give you accuracy either. So probably better is to use A26 variant of IPP sine.

Hope this helps,
Regards,
Sergey Maidanov

gol · ‎10-15-2009

I see, I didn't know about IppsSin_64f_A26, I was still browsing the IPP 6.0 help.

Now I'm wrapping the phases to 0..2Pi in doubles(& I'm happy with the accuracy),convert to singles, & use IppsSin_32f_A24 on that.
Reading your explanation that it's the subtraction that generates the loss, I assume this is even helping the _32f_A24 version which too has to wrap phases the same way? I mean, I guess this makes the algo already stop early at "x<=8".

Thanks for the explanation.

Sergey_M_Intel2 · ‎10-16-2009

Quoting - gol

Now I'm wrapping the phases to 0..2Pi in doubles(& I'm happy with the accuracy),convert to singles, & use IppsSin_32f_A24 on that.

Yes, this is what can be done to avoid large args. You might improve a little bit by replacing

Phase wrapping in DP->Convert to SP->Sin_32f_A24->Convert to DP

with

Phase wrapping in DP->Sin_64f_A26

It will save some cycles on converts DP->SP-DP. Also DP A26 sine could be a bit faster than respective SP A24 sine yet somewhat more accurate on 0..2*Pi. Let me knowhow/if it works for you.

If your app is not really sensitive to the accuracy loss due to "naive wrapping" to 0...2*Pi then SP A21 or A11 might work to you. (Converting to SP makes more sense if you need even less accuracy that A24 provides). In this case you may want to try A21 (still close to full precision) or even A11 (half precision) - nice performance gains might be observed.

Quoting - gol

Reading your explanation that it's the subtraction that generates the loss, I assume this is even helping the _32f_A24 version which too has to wrap phases the same way?

Yes, one would face the same issue in SP trigonometric functions as arguments grow in magnitude. :-( And yes, wrapping to 0..2*Pi is the way to save performance.

Regards,
Sergey
P.S.: Engineering team has taken your case for consideration. Specifically, your reduction to 0..2*Pi significantly impacts the overall accuracy. Engineers will consider doing this for you inside A26 sine in such a way that you don't need to do wrapping on your side. If it is feasible to implement, then a) you get more performance by avoiding wrapping on your side and b) you get greater accuracy

gol · ‎10-16-2009

>>>Let me know how/if it works for you.

Worked well. Not faster but as fast, so it's worth the slightly better accuracy.

I'm using it to refresh coefficients in a bulk sine generator, using that trigonometry identity that Tone_Direct is most likely using.
It's hard to decide which accuracy is ok here, because I wanted my sine gen to work on single floats for speed, but that algo quickly loses accuracy using single precision, thus the coefficients need to be refreshed periodically, using real sines this time, hence the call to IppsSin. So the accuracy matters since I'd rather not refresh the generator with inaccurate coefficients. However if the period between refreshes is too small (&the loss seems todepend on the speed of the tone), and if the real sines take too long to compute, then the coefficient refreshing starts chewing more CPU than the generator itself.

Could be interesting to see this in IPP, a 'bulk sinewave generator', kind of an FFT, but not restricted to just harmonics, and not forced to process power of 2 chunks. I got mine really fast, however I think that it'd be hard to ensure a minimum of accuracy for all cases here. But it's something pretty useful for additive audio synthesis.

>>>Engineers will consider doing this for you inside A26 sine in such a way that you don't need to do wrapping on your side.

Sounds cool