Over the years I've been changing my (audio-related) code to use IPP more & more. Not only to get speedups, also to have less assembler code to maintain, and to update for each new processor.
Sometimes IPP's functions were slower than my existing code, sometimes faster, I never really pointed it out because I was using an AMD - not really expecting IPP to work best on AMD's of course.
Anyway, now I have an Intel quad, I was gonna replace my IIR biquad processing by IPP's, but IppsIIR64fc_32fc_I seems to be over 2x slower than mine (working with double coeff/memory too).
My code is all FPU (was as fast as SSE2 scalar double on my old AMD) for mono, and SSE2 packed double for stereo processing.
IppsIIR64f_32f_I turned out to be faster than my FPU code (which I quickly fixed and almost reached the speed of the IPP one).
But IppsIIR64fc_32fc_I looks over 2x slower than IppsIIR64f_32f_I, which tells me that all it's doing is probably de-interlacing, processing twice, interlacing. I think it's too bad, because there's not much room for parallelism in IIR processing, except.. when you process parallel streams (interlaced stereo here, I know 'complex' could be something else but I suppose _32fc can really apply to interlaced stereo here?).
And indeed, my stereo function using packed doubles is almost as fast as the mono one using scalar instructions. (only 10% slower than IppsIIR64f_32f_I).
Is IppsIIR64fc_32fc_I really mapping to deinterlacing/processing/interlacing in the latest IPP?
Also, sometimes I wonder if I'm the last one still processing interlaced stereo audio & that maybe no one else cares for those formats. But IMHO it's pretty nice for SIMD processing.
Link Copied
[cpp]IppsIIR64fc_32fc_I is function that knows nothing about audio processing, interlacing and de-interlacing.
It is real complex IIR function that works with complex numbers (coefficients and source data): One point is calculated in the next way: inp.re = src.re;
inp.im = src.im;
for( d = 0, t = 0; d < numBq; d += 2, t += 5 )
{ out.re = pDly[d+0].re + MUL_RE( pTaps[t+0], inp ); out.im = pDly[d+0].im + MUL_IM( pTaps[t+0], inp );
pDly[d+0].re = pDly[d+1].re - MUL_RE( pTaps[t+3], out ) + MUL_RE( pTaps[t+1], inp ); pDly[d+0].im = pDly[d+1].im - MUL_IM( pTaps[t+3], out ) + MUL_IM( pTaps[t+1], inp ); pDly[d+1].re = -MUL_RE( pTaps[t+4], out ) + MUL_RE( pTaps[t+2], inp ); pDly[d+1].im = -MUL_IM( pTaps[t+4], out ) + MUL_IM( pTaps[t+2], inp ); inp.re = out.re; inp.im = out.im; } (*pDstVal).re = (Ipp32f)out.re; (*pDstVal).im = (Ipp32f)out.im; [/cpp]
IPP providemulti-channel IIRs (especially for Audio processing) but they are not intended for interleaved data:
[cpp]/* ///////////////////////////////////////////////////////////////////////////// // Names: ippsIIR_32f_P, ippsIIR64f_32s_P // Purpose: IIR filter for multi-channel data. Vector filtering. // Parameters: // ppSrc - pointer to array of pointers to source vectors // ppDst - pointer to array of pointers to destination vectors // ppSrcDst - pointer to array of source/destination vectors in in-place ops // len - length of the vectors // nChannels - number of processing channels // ppState - pointer to array of filter contexts // Return: // ippStsContextMatchErr - wrong context identifier // ippStsNullPtrErr - pointer(s) to the data is NULL // ippStsSizeErr - length of the vectors <= 0 // ippStsChannelErr - number of processing channels <= 0 // ippStsNoErr - otherwise // */ IPPAPI( IppStatus, ippsIIR_32f_P,( const Ipp32f **ppSrc, Ipp32f **ppDst, int len, int nChannels, IppsIIRState_32f **ppState )) IPPAPI( IppStatus, ippsIIR_32f_IP,( Ipp32f **ppSrcDst, int len, int nChannels, IppsIIRState_32f **ppState )) IPPAPI(IppStatus, ippsIIR64f_32s_PSfs, (const Ipp32s **ppSrc, Ipp32s **ppDst, int len, int nChannels, IppsIIRState64f_32s **ppState, int *pScaleFactor)) IPPAPI(IppStatus, ippsIIR64f_32s_IPSfs, (Ipp32s **ppSrcDst, int len, int nChannels, IppsIIRState64f_32s **ppState, int *pScaleFactor)) [/cpp]
[cpp]The best choice for this purpose are CplxToReal and RealToCplx they are highly optimized: /* ///////////////////////////////////////////////////////////////////////////// // Name: ippsCplxToReal // Purpose: form the real and imaginary parts of the input complex vector // Parameters: // pSrc pointer to the input complex vector // pDstRe pointer to output vector to store the real part // pDstIm pointer to output vector to store the imaginary part // len length of the vectors, number of items // Return: // ippStsNullPtrErr pointer(s) to the data is NULL // ippStsSizeErr length of the vectors is less or equal zero // ippStsNoErr otherwise */ IPPAPI(IppStatus, ippsCplxToReal_64fc,( const Ipp64fc* pSrc, Ipp64f* pDstRe, Ipp64f* pDstIm, int len )) IPPAPI(IppStatus, ippsCplxToReal_32fc,( const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len )) IPPAPI(IppStatus, ippsCplxToReal_16sc,( const Ipp16sc* pSrc, Ipp16s* pDstRe, Ipp16s* pDstIm, int len )) /* ///////////////////////////////////////////////////////////////////////////// // Name: ippsRealToCplx // Purpose: form complex vector from the real and imaginary components // Parameters: // pSrcRe pointer to the input vector with real part, may be NULL // pSrcIm pointer to the input vector with imaginary part, may be NULL // pDst pointer to the output complex vector // len length of the vectors // Return: // ippStsNullPtrErr pointer to the destination data is NULL // ippStsSizeErr length of the vectors is less or equal zero // ippStsNoErr otherwise // // Notes: one of the two input pointers may be NULL. In this case // the corresponding values of the output complex elements is 0 */ IPPAPI(IppStatus, ippsRealToCplx_64f,( const Ipp64f* pSrcRe, const Ipp64f* pSrcIm, Ipp64fc* pDst, int len )) IPPAPI(IppStatus, ippsRealToCplx_32f,( const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len )) IPPAPI(IppStatus, ippsRealToCplx_16s,( const Ipp16s* pSrcRe, const Ipp16s* pSrcIm, Ipp16sc* pDst, int len )) [/cpp]
For more complete information about compiler optimizations, see our Optimization Notice.