Re: Debug build is 2x faster

xilin · ‎10-11-2006

I have a application using IPP 5.1 to do QAM demodulation, some how
debug build is over 2x faster than release build, even after I disabled optimizations in release build. This happened on both a dual Xeon (2.8Ghz) and a P4 3.06GHz. I wonder what makes the difference.
The processing is basically: BandPass - Mix - LowPass - Magnitute - down sampling.
I am using Visual studio 2005.

Vladimir_Dudnik · ‎10-12-2006

Hello,

how do you link IPP libs, dynamically or statically? If you use static linking did you call ippStaticInit function?

Regards,
Vladimir

linx · ‎10-12-2006

I am using dynamic linking. I will try static to see what happened. Thx.

Vladimir_Dudnik · ‎10-12-2006

that's something strange, could you share piece of code? What is your target platform/processor?

linx · ‎10-12-2006

Sure, this is a section of the code. This is part of ultrasound imaging, frame data are broken into vectors and convert from Ipp16s to Ipp32f, then feed into this function, after this piece of code, we send data to directx functions to display. Strange thing is I had the code with Ipp4.1/Visual Studio 2003, and didn't notice the problem. I just tried static link debug build is still faster, though only by ~10%.
Currently we are doing everything in FPGA, I am just looking if possible to move this
to SW.

//#define SAMPLE_VECTOR 512
//m_nSamplesIn = 4096; nBp = nLp = 65;
// BPF, LPF are coefficients of filters (Ipp32f)

void CEnv:: ProcessVector(Ipp32f* pSrc, Ipp32f* pDst)
{
// BPF -> m_pTemp
ippsConv_32f(pSrc, m_nSamplesIn, BPF, nBp, m_pTemp);

//Mix with sin/cosine -> m_pQ, m_pI
ippsMul_32f(m_sin, m_pTemp + nBp - 1, m_pI, m_nSamplesIn);
ippsMul_32f(m_cos, m_pTemp + nBp - 1, m_pQ, m_nSamplesIn);

// LPF(m_pI)->m_Temp. LPF(m_pQ)->m_pI
ippsConv_32f(m_pI, m_nSamplesIn, &LPF[0], nLp, m_pTemp);
ippsConv_32f(m_pQ, m_nSamplesIn, &LPF[0], nLp, m_pI);

// decimate
int down = m_nSamplesIn / SAMPLE_VECTOR, phase = 0;
int len;

ippsSampleDown_32f(m_pTemp+nLp -1, down * SAMPLE_VECTOR, m_pQ, &len, down, &phase);
ippsSampleDown_32f(m_pI+nLp-1, down * SAMPLE_VECTOR, m_pTemp, &len, down, &phase);

// envelope
ippsMagnitude_32f(m_pQ, m_pTemp, pDst, SAMPLE_VECTOR);
}

Vladimir_Dudnik · ‎10-12-2006

Is your memory buffers (pSrc and pDst) aligned on 16-bytes boundary (better 32 bytes)? You know, Intel processors can access data quite efficient in case of aligned addresses. I just not see other reasons for that strange behaviour. To make sure vectors correctly aligned I recommend you allocate them with ippMalloc function (ippsMalloc_xx family functions) and free with ippFree function.

Vladimir_Dudnik · ‎10-12-2006

Additional suggestion is to parallelize your processing. It seems rows in your case are processed independently and so two rows can be done in parallel on dual-core systems. Do you use that opportunity?

linx · ‎10-12-2006

Buffers are aligned to page (4096). My system is single core. Tough I do have two threads each process half of a frame.

Vladimir_Dudnik · ‎10-12-2006

Thanks. BTW,are your resultsthe same between debug and release build and are they correct? Could you also to wrap each function call with timers, to see where you spend more time than expected?

linx · ‎10-13-2006

Visually both look correct and similiar to images produced by FPGA or Matlab, I haven't compared every bit. I will do some profiling. Thanks.