Biquad stuff kinda slow for ('complex') stereo

gol · ‎04-21-2009

Over the years I've been changing my (audio-related) code to use IPP more & more. Not only to get speedups, also to have less assembler code to maintain, and to update for each new processor.

Sometimes IPP's functions were slower than my existing code, sometimes faster, I never really pointed it out because I was using an AMD - not really expecting IPP to work best on AMD's of course.

Anyway, now I have an Intel quad, I was gonna replace my IIR biquad processing by IPP's, but IppsIIR64fc_32fc_I seems to be over 2x slower than mine (working with double coeff/memory too).

My code is all FPU (was as fast as SSE2 scalar double on my old AMD) for mono, and SSE2 packed double for stereo processing.
IppsIIR64f_32f_I turned out to be faster than my FPU code (which I quickly fixed and almost reached the speed of the IPP one).
But IppsIIR64fc_32fc_I looks over 2x slower than IppsIIR64f_32f_I, which tells me that all it's doing is probably de-interlacing, processing twice, interlacing. I think it's too bad, because there's not much room for parallelism in IIR processing, except.. when you process parallel streams (interlaced stereo here, I know 'complex' could be something else but I suppose _32fc can really apply to interlaced stereo here?).
And indeed, my stereo function using packed doubles is almost as fast as the mono one using scalar instructions. (only 10% slower than IppsIIR64f_32f_I).

Is IppsIIR64fc_32fc_I really mapping to deinterlacing/processing/interlacing in the latest IPP?

Also, sometimes I wonder if I'm the last one still processing interlaced stereo audio & that maybe no one else cares for those formats. But IMHO it's pretty nice for SIMD processing.

Vladimir_Dudnik · ‎04-21-2009

Hello,

thanks for pointing that out.

there are comments from our expert:

[cpp]IppsIIR64fc_32fc_I is function that knows nothing about audio processing, interlacing and de-interlacing.
It is real complex IIR function that works with complex numbers (coefficients and source data):

One point is calculated in the next way:

    inp.re  = src.re;
    inp.im = src.im;
    
    for( d = 0, t = 0; d < numBq; d += 2, t += 5 )
    {
        out.re  = pDly[d+0].re + MUL_RE( pTaps[t+0], inp );
        out.im = pDly[d+0].im + MUL_IM( pTaps[t+0], inp );



        pDly[d+0].re  = pDly[d+1].re - MUL_RE( pTaps[t+3], out ) + MUL_RE( pTaps[t+1], inp );
        pDly[d+0].im = pDly[d+1].im - MUL_IM( pTaps[t+3], out ) + MUL_IM( pTaps[t+1], inp );
        pDly[d+1].re  = -MUL_RE( pTaps[t+4], out ) + MUL_RE( pTaps[t+2], inp );
        pDly[d+1].im = -MUL_IM( pTaps[t+4], out ) + MUL_IM( pTaps[t+2], inp );
        inp.re  = out.re;
        inp.im = out.im;
    }
    (*pDstVal).re = (Ipp32f)out.re;
    (*pDstVal).im = (Ipp32f)out.im;
[/cpp]

It is highly optimized with SSE instructions.

Could you pleasespecify what version of IPP do you use, what is OS where you run your test case?Also would be nice if you canprovide us test conditions (src, srcLen, taps, numBq and especially dst to see calculation results you get with your code) we can investigate what might be the reason for performance issue you mention.

Regards,
Vladimir

gol · ‎04-21-2009

I guess it's just me who doesn't understand the function then.

By the lack of interlaced signal processing support (which, again, is too bad because it's pretty much what multimedia instructions are good for), I've always used IPP functions on complex numbers to process interlaced stereo. For most of the basic functions it does work (reading more about complex #'s, I still don't understand their use nor the point of IIR filtering them, but I see that addition & multiplication by a constant works independently, so I guess that's why).

So I thought that this too would filter 2 interlaced signals independently, I even tested and it looked so (that's what's weird).But from the code it doesn't look like it's the case ("MUL_RE( pTaps[t+3], out )").

Ok so that function won't help me (& sry for the mistake). But again, it's too bad because to use IPP I would have to deinterlace, process the 2 channels, and interlace.
Meaning that someone who has to process something very serial like IIRfiltering onup to 2 or 4 signalswill end up using multithreading (at a cost), while multimedia instructions can help (& really do here), but aren't used. In things like vocoders or graphic EQs that involve lots of parallel filtering, this could speed up by 2.

Vladimir_Dudnik · ‎04-22-2009

There is additional comment from our expert:

IPP providemulti-channel IIRs (especially for Audio processing) but they are not intended for interleaved data:

[cpp]/* /////////////////////////////////////////////////////////////////////////////
//  Names:         ippsIIR_32f_P, ippsIIR64f_32s_P
//  Purpose:       IIR filter for multi-channel data. Vector filtering.
//  Parameters:
//      ppSrc               - pointer to array of pointers to source vectors
//      ppDst               - pointer to array of pointers to destination vectors
//      ppSrcDst            - pointer to array of source/destination vectors in in-place ops
//      len                 - length of the vectors
//      nChannels           - number of processing channels
//      ppState             - pointer to array of filter contexts
//  Return:
//      ippStsContextMatchErr  - wrong context identifier
//      ippStsNullPtrErr       - pointer(s) to the data is NULL
//      ippStsSizeErr          - length of the vectors <= 0
//      ippStsChannelErr       - number of processing channels <= 0
//      ippStsNoErr            - otherwise
//
*/
IPPAPI( IppStatus, ippsIIR_32f_P,( const Ipp32f **ppSrc, Ipp32f **ppDst, int len,
       int nChannels, IppsIIRState_32f **ppState ))
IPPAPI( IppStatus, ippsIIR_32f_IP,( Ipp32f **ppSrcDst, int len,
       int nChannels, IppsIIRState_32f **ppState ))
IPPAPI(IppStatus, ippsIIR64f_32s_PSfs, (const Ipp32s **ppSrc, Ipp32s **ppDst, int len,
       int nChannels, IppsIIRState64f_32s **ppState, int *pScaleFactor))
IPPAPI(IppStatus, ippsIIR64f_32s_IPSfs, (Ipp32s **ppSrcDst, int len,
       int nChannels, IppsIIRState64f_32s **ppState, int *pScaleFactor))
[/cpp]

Regards,
Vladimir

gol · ‎04-22-2009

Doh, for some reason I still have the habit of reading the IPP 5.1 docs, and I indeed see those functions in the 6.0 docs.

Weird, however, that they're only for "32f" and "64f_32s" (double float to integer?? I guess this was a user request).

Anyway, I supposethat you plan to implement the missing ones in the future.
They won't really help me because interlacing/deinterlacing has a cost (I've measured IppsDeInterleave_32f+IppsInterleave_32featingfrom 30% up to 150%the CPUof the filtering itself. Afterall biquad filtering is kinda light in operations, it's more or less about mem/cache access), but you made the right choice because most people are probably processing separated signal buffers. It's just that I'm stuck with interlaced stereo.

Vladimir_Dudnik · ‎04-23-2009

Thanks, note you may submit report to Intel Premier Support on performance issue you face with ippsDeInterleave function

By the way, our expert recommend to use other functions instead of deinterleave:

[cpp]The best choice for this purpose are CplxToReal and RealToCplx  they are highly optimized:

/* /////////////////////////////////////////////////////////////////////////////
//  Name:       ippsCplxToReal
//  Purpose:    form the real and imaginary parts of the input complex vector
//  Parameters:
//    pSrc       pointer to the input complex vector
//    pDstRe     pointer to output vector to store the real part
//    pDstIm     pointer to output vector to store the imaginary part
//    len        length of the vectors, number of items
//  Return:
//    ippStsNullPtrErr        pointer(s) to the data is NULL
//    ippStsSizeErr           length of the vectors is less or equal zero
//    ippStsNoErr             otherwise
*/

IPPAPI(IppStatus, ippsCplxToReal_64fc,( const Ipp64fc* pSrc, Ipp64f* pDstRe,
       Ipp64f* pDstIm, int len ))
IPPAPI(IppStatus, ippsCplxToReal_32fc,( const Ipp32fc* pSrc, Ipp32f* pDstRe,
       Ipp32f* pDstIm, int len ))
IPPAPI(IppStatus, ippsCplxToReal_16sc,( const Ipp16sc* pSrc, Ipp16s* pDstRe,
       Ipp16s* pDstIm, int len ))

/* /////////////////////////////////////////////////////////////////////////////
//  Name:       ippsRealToCplx
//  Purpose:    form complex vector from the real and imaginary components
//  Parameters:
//    pSrcRe     pointer to the input vector with real part, may be NULL
//    pSrcIm     pointer to the input vector with imaginary part, may be NULL
//    pDst       pointer to the output complex vector
//    len        length of the vectors
//  Return:
//    ippStsNullPtrErr        pointer to the destination data is NULL
//    ippStsSizeErr           length of the vectors is less or equal zero
//    ippStsNoErr             otherwise
//
//  Notes:      one of the two input pointers may be NULL. In this case
//              the corresponding values of the output complex elements is 0
*/

IPPAPI(IppStatus, ippsRealToCplx_64f,( const Ipp64f* pSrcRe,
       const Ipp64f* pSrcIm, Ipp64fc* pDst, int len ))
IPPAPI(IppStatus, ippsRealToCplx_32f,( const Ipp32f* pSrcRe,
       const Ipp32f* pSrcIm, Ipp32fc* pDst, int len ))
IPPAPI(IppStatus, ippsRealToCplx_16s,( const Ipp16s* pSrcRe,
       const Ipp16s* pSrcIm, Ipp16sc* pDst, int len ))
[/cpp]

Regards,
Vladimir

gol · ‎04-23-2009

By the way, our expert recommend to use other functions instead of deinterleave:

good to know, I'll try that, thank you!