Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Debug build is 2x faster

xilin
Beginner
817 Views
I have a application using IPP 5.1 to do QAM demodulation, some how
debug build is over 2x faster than release build, even after I disabled optimizations in release build. This happened on both a dual Xeon (2.8Ghz) and a P4 3.06GHz. I wonder what makes the difference.
The processing is basically: BandPass - Mix - LowPass - Magnitute - down sampling.
I am using Visual studio 2005.
0 Kudos
9 Replies
Vladimir_Dudnik
Employee
817 Views

Hello,

how do you link IPP libs, dynamically or statically? If you use static linking did you call ippStaticInit function?

Regards,
Vladimir

0 Kudos
linx
Beginner
817 Views
I am using dynamic linking. I will try static to see what happened. Thx.

0 Kudos
Vladimir_Dudnik
Employee
817 Views
that's something strange, could you share piece of code? What is your target platform/processor?
0 Kudos
linx
Beginner
817 Views
Sure, this is a section of the code. This is part of ultrasound imaging, frame data are broken into vectors and convert from Ipp16s to Ipp32f, then feed into this function, after this piece of code, we send data to directx functions to display. Strange thing is I had the code with Ipp4.1/Visual Studio 2003, and didn't notice the problem. I just tried static link debug build is still faster, though only by ~10%.
Currently we are doing everything in FPGA, I am just looking if possible to move this
to SW.

//#define SAMPLE_VECTOR 512
//m_nSamplesIn = 4096; nBp = nLp = 65;
// BPF, LPF are coefficients of filters (Ipp32f)

void CEnv:: ProcessVector(Ipp32f* pSrc, Ipp32f* pDst)
{
// BPF -> m_pTemp
ippsConv_32f(pSrc, m_nSamplesIn, BPF, nBp, m_pTemp);

//Mix with sin/cosine -> m_pQ, m_pI
ippsMul_32f(m_sin, m_pTemp + nBp - 1, m_pI, m_nSamplesIn);
ippsMul_32f(m_cos, m_pTemp + nBp - 1, m_pQ, m_nSamplesIn);

// LPF(m_pI)->m_Temp. LPF(m_pQ)->m_pI
ippsConv_32f(m_pI, m_nSamplesIn, &LPF[0], nLp, m_pTemp);
ippsConv_32f(m_pQ, m_nSamplesIn, &LPF[0], nLp, m_pI);

// decimate
int down = m_nSamplesIn / SAMPLE_VECTOR, phase = 0;
int len;

ippsSampleDown_32f(m_pTemp+nLp -1, down * SAMPLE_VECTOR, m_pQ, &len, down, &phase);
ippsSampleDown_32f(m_pI+nLp-1, down * SAMPLE_VECTOR, m_pTemp, &len, down, &phase);

// envelope
ippsMagnitude_32f(m_pQ, m_pTemp, pDst, SAMPLE_VECTOR);
}
0 Kudos
Vladimir_Dudnik
Employee
817 Views
Is your memory buffers (pSrc and pDst) aligned on 16-bytes boundary (better 32 bytes)? You know, Intel processors can access data quite efficient in case of aligned addresses. I just not see other reasons for that strange behaviour. To make sure vectors correctly aligned I recommend you allocate them with ippMalloc function (ippsMalloc_xx family functions) and free with ippFree function.
0 Kudos
Vladimir_Dudnik
Employee
817 Views

Additional suggestion is to parallelize your processing. It seems rows in your case are processed independently and so two rows can be done in parallel on dual-core systems. Do you use that opportunity?

0 Kudos
linx
Beginner
817 Views
Buffers are aligned to page (4096). My system is single core. Tough I do have two threads each process half of a frame.
0 Kudos
Vladimir_Dudnik
Employee
817 Views
Thanks. BTW,are your resultsthe same between debug and release build and are they correct? Could you also to wrap each function call with timers, to see where you spend more time than expected?
0 Kudos
linx
Beginner
817 Views
Visually both look correct and similiar to images produced by FPGA or Matlab, I haven't compared every bit. I will do some profiling. Thanks.
0 Kudos
Reply