Re: Slow running ippsSinCos routine

atd · ‎06-07-2005

Hi,

I was hoping to get a little help on why the ippsSinCos_32f_A11 routine is behaving strangely in my code. I am developing an application for astronomical data processing which relies heavily on vector manipulation, mainly cross-multiplies, FFTs and sin/cos.

I am profiling the code with callgrind and kcachegrind, and am finding that in one method where I call ippsSinCos_32f_A11 twice, one execution takes roughly 2% of total CPU time while the other takes nearly 30% of total CPU time. The second call operates on a vector twice as long as the first, but I can't see why this should increase running time by a factor of 15. Indeed, a FFT_RtoCCS which operates on the outputs of the second SinCos routine (ie operates twice) takes only 16% of total execution time, or 8% per vector, more than 3 times faster than the SinCos.

Can anyone offer any ideas of things to check which could be causing this routine to run so slowly?

Thanks in advance,
Adam

atd · ‎06-07-2005

Some more information: In the profiler, i see that the function ippsSinCos_32f_A11 has been called approximately 9400 times, which is the correct amount for the problem I am running it on. However, looking at the function tree, the ippsSinCos function has made 1.6 million calls to vmlsSinCos_SC_A11, which in turn calls _vmlsSinCos_SC, which calls _dSinCos. Is it right that calls to SinCos should explode by a factor of ~170 in calls to the underlying functions?

Adam

Vladimir_Dudnik · ‎06-07-2005

Hi,

could you please add more details about version of IPP do you use and platform where are are running your code?

Personally my first though is if you have exceeded processor L2 cache size there can be visible slowdown because of slow memory access.

Regards,

Vladimir

Sergey_M_Intel2 · ‎06-07-2005

Hi Adam,

We know the root of your problem. To avoid accuracy loss, the algorithms for sine and cosine functions must carefully deal with very large arguments.This means that algorithms for large arguments are usually much more complicated and hence are much slower than algorithms for small arguments.One of consequences is that each of large argumentsis processed in special path (namely in vmlsSinCos_SC_A11 routine).

Specifically, for ippsSinCos_32f_A11 routine all arguments |x|>=10000 are processed in vmlsSinCos_SC_A11.

The only practical recommendation we have is to check your algorithm if it is really important to deal with such large arguments (from physical point of view very large arguments for sine and cosine are rarely used). Please check if you can reduce somehowinput arguments to a moderate range (e.g. using periodicity property). This is only the way to make SinCos really fast.

It is imposible to dramatically improve "large arguments" path in SinCos. Unfortunately, there is no magic: Large arguments in sine and cosine are expensive if there are some accuracy requirements. Specifically, ipps*_32f_A11 functions provide ~11 bits of accuracy for the _whole_ function domain.

If you have any questions please don't hesitate to ask. We will answer with pleasure.

Regards,

Sergey

atd · ‎06-08-2005

Thank you very much Sergey!

I did not realise, but my code was generating large arguments for the sincos routine. I have modified it so that the arguments are bounded between 0 and 2*pi, using ippsConvert routines to cast the floats to integers. I have included my old code and new code at the bottom of this message, with the error checking removed for conciseness. This has cut overall program execution time by ~25-30% which is very pleasing. The sincos routine, which was taking 30% of the original execution time, now takes less than 8% of the reduced execution time, even with the extra calls taken into account.

Is there any other way you can suggest which may perform faster than the code I have shown below?

I do have one suggestion/request: A function which took a floating point vector and returned the fractional component of the vector would be very useful, as would be a function which returned the remainders of an element by element vector division. At the moment I have worked around this using ippsConvert with ippRndZero to calculate the integer component, then converted this back to a float and subtracted from the original vector.

Anyway, thank you again for your timely help!

=======ORIGINAL CODE======

variables: fringedelayarray is a vector with elements bounded between 0.0 and 1.0
loFreqs = a large number

ippsMulC_32f(fringedelayarray, IPP_2PI*loFreqs, rotateargument, twicenumchannels);
ippsSinCos_32f_A11(rotateargument, sinrotated, cosrotated, twicenumchannels);

=======NEW CODE============

variables: same, with the addition of the hopefully obvious introtateargument and floatrotateargument

ippsMulC_32f(fringedelayarray, loFreqs, rotateargument, twicenumchannels);
//chop off the non-fractional part
ippsConvert_32f32s_Sfs(rotateargument, introtateargument, twicenumchannels, ippRndZero, 0);
ippsConvert_32s32f(introtateargument, floatrotateargument, twicenumchannels);
ippsSub_32f_I(floatrotateargument, rotateargument, twicenumchannels);
ippsMulC_32f_I(IPP_2PI, rotateargument, twicenumchannels);
ippsSinCos_32f_A11(rotateargument, sinrotated, cosrotated, twicenumchannels);

Sergey_M_Intel2 · ‎06-08-2005

Hi Adam,

Your modified code looks good. If you are compiling the application with Intel compiler (ver. 8.0 or higher) then I havean idea how additionally improve the performance.

/* Example: Calculate the vector of fractional parts */

#include

void foo(float * x, float * r, int n)
{
int i;
for (i = 0; i < n; i++)
{
r = x - floor(x);
}
}

Latest Intel compilers are smart enough to "vectorize" such loops, i.e. all computations in the loop will be done using SSE/SSE2 instructions. I would expect that compiler will generate faster codes than it is done in your modified version (mainly because there will not be theconversion int<->float). Anyway, I would not expect dramatic performance increase, since it is likely that this loop takes just small fraction of the time of the overall application.

Let me know if it works for you.

With respect to your request, it is well taken. We are considering adding several relevant functions, such as Rem (reminder of a/b), DivRem (quotient and remainder of a/b), Floor, Ceil, Round. Hopefully, these function will add some flexibility as well as bring some performance benefits.

Regards,

Sergey

P.S. I've attached the test case. There is generated asm code to see how smart is the compiler. The file report.txt is the compiler messages output. If the loop is vectorized, you should see the corresponding message.

atd · ‎06-09-2005

Hi Sergey,

I am currently using gcc (with -O2) to compile my code - I tried with the suggested loop and code ran slower, so obviously gcc is not vectorising the loop. I have access to icc v8, but when I changed compilers I ran into a whole mess of errors. I am developing in KDevelop, and I think this is probably KDevelop's fault, but I couldn't track down the problem straight away. As you say, I suspect the performance increase to be gained is very small as the ippsConvert routines run very quickly anyway.

Thanks again,
Adam