- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a application using IPP 5.1 to do QAM demodulation, some how
debug build is over 2x faster than release build, even after I disabled optimizations in release build. This happened on both a dual Xeon (2.8Ghz) and a P4 3.06GHz. I wonder what makes the difference.
The processing is basically: BandPass - Mix - LowPass - Magnitute - down sampling.
I am using Visual studio 2005.
debug build is over 2x faster than release build, even after I disabled optimizations in release build. This happened on both a dual Xeon (2.8Ghz) and a P4 3.06GHz. I wonder what makes the difference.
The processing is basically: BandPass - Mix - LowPass - Magnitute - down sampling.
I am using Visual studio 2005.
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
how do you link IPP libs, dynamically or statically? If you use static linking did you call ippStaticInit function?
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using dynamic linking. I will try static to see what happened. Thx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that's something strange, could you share piece of code? What is your target platform/processor?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, this is a section of the code. This is part of ultrasound imaging, frame data are broken into vectors and convert from Ipp16s to Ipp32f, then feed into this function, after this piece of code, we send data to directx functions to display. Strange thing is I had the code with Ipp4.1/Visual Studio 2003, and didn't notice the problem. I just tried static link debug build is still faster, though only by ~10%.
Currently we are doing everything in FPGA, I am just looking if possible to move this
to SW.
//#define SAMPLE_VECTOR 512
//m_nSamplesIn = 4096; nBp = nLp = 65;
// BPF, LPF are coefficients of filters (Ipp32f)
void CEnv:: ProcessVector(Ipp32f* pSrc, Ipp32f* pDst)
{
// BPF -> m_pTemp
ippsConv_32f(pSrc, m_nSamplesIn, BPF, nBp, m_pTemp);
//Mix with sin/cosine -> m_pQ, m_pI
ippsMul_32f(m_sin, m_pTemp + nBp - 1, m_pI, m_nSamplesIn);
ippsMul_32f(m_cos, m_pTemp + nBp - 1, m_pQ, m_nSamplesIn);
// LPF(m_pI)->m_Temp. LPF(m_pQ)->m_pI
ippsConv_32f(m_pI, m_nSamplesIn, &LPF[0], nLp, m_pTemp);
ippsConv_32f(m_pQ, m_nSamplesIn, &LPF[0], nLp, m_pI);
// decimate
int down = m_nSamplesIn / SAMPLE_VECTOR, phase = 0;
int len;
ippsSampleDown_32f(m_pTemp+nLp -1, down * SAMPLE_VECTOR, m_pQ, &len, down, &phase);
ippsSampleDown_32f(m_pI+nLp-1, down * SAMPLE_VECTOR, m_pTemp, &len, down, &phase);
// envelope
ippsMagnitude_32f(m_pQ, m_pTemp, pDst, SAMPLE_VECTOR);
}
Currently we are doing everything in FPGA, I am just looking if possible to move this
to SW.
//#define SAMPLE_VECTOR 512
//m_nSamplesIn = 4096; nBp = nLp = 65;
// BPF, LPF are coefficients of filters (Ipp32f)
void CEnv:: ProcessVector(Ipp32f* pSrc, Ipp32f* pDst)
{
// BPF -> m_pTemp
ippsConv_32f(pSrc, m_nSamplesIn, BPF, nBp, m_pTemp);
//Mix with sin/cosine -> m_pQ, m_pI
ippsMul_32f(m_sin, m_pTemp + nBp - 1, m_pI, m_nSamplesIn);
ippsMul_32f(m_cos, m_pTemp + nBp - 1, m_pQ, m_nSamplesIn);
// LPF(m_pI)->m_Temp. LPF(m_pQ)->m_pI
ippsConv_32f(m_pI, m_nSamplesIn, &LPF[0], nLp, m_pTemp);
ippsConv_32f(m_pQ, m_nSamplesIn, &LPF[0], nLp, m_pI);
// decimate
int down = m_nSamplesIn / SAMPLE_VECTOR, phase = 0;
int len;
ippsSampleDown_32f(m_pTemp+nLp -1, down * SAMPLE_VECTOR, m_pQ, &len, down, &phase);
ippsSampleDown_32f(m_pI+nLp-1, down * SAMPLE_VECTOR, m_pTemp, &len, down, &phase);
// envelope
ippsMagnitude_32f(m_pQ, m_pTemp, pDst, SAMPLE_VECTOR);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your memory buffers (pSrc and pDst) aligned on 16-bytes boundary (better 32 bytes)? You know, Intel processors can access data quite efficient in case of aligned addresses. I just not see other reasons for that strange behaviour. To make sure vectors correctly aligned I recommend you allocate them with ippMalloc function (ippsMalloc_xx family functions) and free with ippFree function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional suggestion is to parallelize your processing. It seems rows in your case are processed independently and so two rows can be done in parallel on dual-core systems. Do you use that opportunity?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Buffers are aligned to page (4096). My system is single core. Tough I do have two threads each process half of a frame.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. BTW,are your resultsthe same between debug and release build and are they correct? Could you also to wrap each function call with timers, to see where you spend more time than expected?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Visually both look correct and similiar to images produced by FPGA or Matlab, I haven't compared every bit. I will do some profiling. Thanks.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page