topic Re: Debug build is 2x faster in Intel® Integrated Performance Primitives

Debug build is 2x faster

xilin — Thu, 12 Oct 2006 02:19:26 GMT

I have a application using IPP 5.1 to do QAM demodulation, some how
debug build is over 2x faster than release build, even after I disabled optimizations in release build. This happened on both a dual Xeon (2.8Ghz) and a P4 3.06GHz. I wonder what makes the difference.
The processing is basically: BandPass - Mix - LowPass - Magnitute - down sampling.
I am using Visual studio 2005.

Re: Debug build is 2x faster

Vladimir_Dudnik — Thu, 12 Oct 2006 21:53:24 GMT

Hello,

how do you link IPP libs, dynamically or statically? If you use static linking did you call ippStaticInit function?

Regards,
Vladimir

Re: Debug build is 2x faster

linx — Fri, 13 Oct 2006 01:38:27 GMT

I am using dynamic linking. I will try static to see what happened. Thx.

Re: Debug build is 2x faster

Vladimir_Dudnik — Fri, 13 Oct 2006 01:50:14 GMT

that's something strange, could you share piece of code? What is your target platform/processor?

Re: Debug build is 2x faster

linx — Fri, 13 Oct 2006 02:14:41 GMT

Sure, this is a section of the code. This is part of ultrasound imaging, frame data are broken into vectors and convert from Ipp16s to Ipp32f, then feed into this function, after this piece of code, we send data to directx functions to display. Strange thing is I had the code with Ipp4.1/Visual Studio 2003, and didn't notice the problem. I just tried static link debug build is still faster, though only by ~10%.
Currently we are doing everything in FPGA, I am just looking if possible to move this
to SW.

//#define SAMPLE_VECTOR 512
//m_nSamplesIn = 4096; nBp = nLp = 65;
// BPF, LPF are coefficients of filters (Ipp32f)

void CEnv:: ProcessVector(Ipp32f* pSrc, Ipp32f* pDst)
{
// BPF -> m_pTemp
ippsConv_32f(pSrc, m_nSamplesIn, BPF, nBp, m_pTemp);

//Mix with sin/cosine -> m_pQ, m_pI
ippsMul_32f(m_sin, m_pTemp + nBp - 1, m_pI, m_nSamplesIn);
ippsMul_32f(m_cos, m_pTemp + nBp - 1, m_pQ, m_nSamplesIn);

// LPF(m_pI)->m_Temp. LPF(m_pQ)->m_pI
ippsConv_32f(m_pI, m_nSamplesIn, &LPF[0], nLp, m_pTemp);
ippsConv_32f(m_pQ, m_nSamplesIn, &LPF[0], nLp, m_pI);

// decimate
int down = m_nSamplesIn / SAMPLE_VECTOR, phase = 0;
int len;

ippsSampleDown_32f(m_pTemp+nLp -1, down * SAMPLE_VECTOR, m_pQ, &len, down, &phase);
ippsSampleDown_32f(m_pI+nLp-1, down * SAMPLE_VECTOR, m_pTemp, &len, down, &phase);

// envelope
ippsMagnitude_32f(m_pQ, m_pTemp, pDst, SAMPLE_VECTOR);
}

Re: Debug build is 2x faster

Vladimir_Dudnik — Fri, 13 Oct 2006 02:28:49 GMT

Is your memory buffers (pSrc and pDst) aligned on 16-bytes boundary (better 32 bytes)? You know, Intel processors can access data quite efficient in case of aligned addresses. I just not see other reasons for that strange behaviour. To make sure vectors correctly aligned I recommend you allocate them with ippMalloc function (ippsMalloc_xx family functions) and free with ippFree function.

Re: Debug build is 2x faster

Vladimir_Dudnik — Fri, 13 Oct 2006 02:49:25 GMT

Additional suggestion is to parallelize your processing. It seems rows in your case are processed independently and so two rows can be done in parallel on dual-core systems. Do you use that opportunity?

Re: Debug build is 2x faster

linx — Fri, 13 Oct 2006 04:38:12 GMT

Buffers are aligned to page (4096). My system is single core. Tough I do have two threads each process half of a frame.

Re: Debug build is 2x faster

Vladimir_Dudnik — Fri, 13 Oct 2006 05:00:00 GMT

Thanks. BTW,are your resultsthe same between debug and release build and are they correct? Could you also to wrap each function call with timers, to see where you spend more time than expected?

Re: Debug build is 2x faster

linx — Sat, 14 Oct 2006 01:08:42 GMT

Visually both look correct and similiar to images produced by FPGA or Matlab, I haven't compared every bit. I will do some profiling. Thanks.