Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

IPP 5.1 on Centrino

ctapang
Beginner
471 Views

I am using FFT routines, and I notice that these run much slower on my laptop (Centrino T7500) than on my hyperthreading desktop. Is this expected? If so, why? Thanks in advance.

--Carlos

0 Kudos
10 Replies
Vladimir_Dudnik
Employee
471 Views

Hi Carlos,

What processor is installed on your hyperthreading desktop?

Anf note please, if you link your application with IPP static libraries you need to call ippStaticInit() function somewhere at the beginning of your application to initialize IPP static dispatcher.

You may check what IPP processor-specific library is loaded by default on your system. Just run ippiDemo executable and choose Help->About from the application menu. It should show that T7 family of libraries are loaded on T7500.

Of course you might be interested to upgrade to the latest version of IPP, which currently is IPP 5.3 (where is V8 processor-specific libraries which are specifically targeted to Intel Core2 architecture).

Regards,
Vladimir

0 Kudos
ctapang
Beginner
471 Views

Thank you, Vladimir.

My desktop processor is a 3-year old Pentium 4 processor, running at 3 GHz and 800MHz front-side bus. It is hyperthreading, has MMX, and SSE/SSE2/SSE3.

I tried IPP Version 5.3, and it's still slow on the Centrino, about 4 times slower than 5.1 running on the old Pentium 4. When I run the demo ps_ippac.exe, it reports that my Centrino is a Core 2 Duo processor 2x2200 MHz, L1=32/32K.

I ran the Processor Identification Utility on the Centrino, and it reports that it has SSE/SSE2/SSE3, but no MMX. So may be not having MMX is a huge factor? If so, then I just bought me a castrated CPU.

0 Kudos
Vladimir_Dudnik
Employee
471 Views

Actuallyavailability of MMX does not affect performance on the newest processors as in IPP we mostly use SSEx instruction set.

Remember, that if you link your application with IPP static libraries you need to call ippStaticInit() function at the beginning of your program otherwise PX code (which does not use MMX, SSX, SSE) will be dispatched in IPP.

Regards,
Vladimir

0 Kudos
ctapang
Beginner
471 Views

Hi Vladimir,

I am not linking with the static libraries. I am letting IPP decide which dynamic library to use. GetCPUType returns ippCPUC2D on version 5.3.1.064. Is there a way to find out exactly which dynamic libraries are being loaded? I appreciate your help.

Thanks.

--Carlos

0 Kudos
Vladimir_Dudnik
Employee
471 Views

If ippGetCpuType return correct value it is most probably everythink is ok with dispatching IPP processoer-specific code. You may ensure that by call ippGetLibVersion() and print out the values in IppLibraryVersion structure. Code like this does that trick:

 const IppLibraryVersion* v = ippjGetLibVersion();
 printf("Intel Integrated Performance Primitives
");
printf(" version: %s, [%d.%d.%d.%d] ",
v->Version, v->major, v->minor, v->majorBuild, v->build);
printf(" name: %s ", v->Name);
printf(" date: %s ", v->BuildDate);

Name of library contains two letter prefix which identifies processor specific code. For example, for Intel Core2 architecture on 32-bit OS it will be V8 (and U8 on 64-bit OS)

Could you please also share your test case, might be there are some issues with range of input data or something like that?

Vladimir

0 Kudos
ctapang
Beginner
471 Views

Here's what I get from ippjGetLibVersion:

IPP: version: 5.3 Update 1 build 85.18, [5.3.85.477], name: ippjv8-5.3.dll, date: Nov 17 2007

And so the correct DLL is being loaded.

Let me try to attach a couple of files to this message.

0 Kudos
ctapang
Beginner
471 Views

Here is part of the code I am running. Please note again that this same coderuns fast on the Pentium 4, but very slow on the Centrino.

#include

"ipp.h"

#include

"NearestNeighbors.h"

#if

MODE!=2

#include "LocalMaxima.h"

#endif

#include

"StdAfx.h"

#include

"MotionDetector.h"

#ifdef

TRACEVECTORS

// TraceVectors is not part of the VS project so that it does not get linked in if _DEBUG is OFF.

#include "TraceVectors.cpp"

#endif

/// MotionDetector class

MotionDetector::MotionDetector(

void)

{

InitializeCriticalSection( &m_cs );

fftSize.width = 1<

fftSize.height = 1<

halfW = fftSize.width / 2;

halfH = fftSize.height / 2;

m_pYBuf = NULL;

m_bEnabled =

true;

m_bDrawImage =

false;

ippiFFTInitAlloc_R_32f( &pFFTspec, orderX, orderY, IPP_FFT_DIV_BY_SQRTN, ippAlgHintAccurate );

ippiFFTInitAlloc_C_32fc( &pCFFTSpec, orderX, orderY, IPP_FFT_DIV_BY_SQRTN, ippAlgHintAccurate );

m_pMagnitudes = AllocateFloatBuf(&fftSize);

m_pYBufT1 = AllocateFloatBuf(&fftSize);

m_pYBufT2 = AllocateFloatBuf(&fftSize);

m_pPhaseBuf = AllocateFloatBuf(&fftSize);

int sz = fftSize.width * fftSize.height * sizeof(Ipp32fc);

m_pPhaseDiffMap = (Ipp32fc*)ippMalloc( sz );

m_pComplexNumberImage = (Ipp32fc*)ippMalloc( sz );

int x, y;

for (y = 0; y < fftSize.height; y++)

{

Ipp32f* pRow = m_pMagnitudes + y * fftSize.width;

for (x = 0; x < fftSize.width; x++)

{

pRow = 66.0;

}

}

}

MotionDetector::~MotionDetector(

void)

{

EnterCriticalSection(&m_cs);

if (m_pYBufT1 != NULL)

{

ippFree(m_pYBufT1);

m_pYBufT1 = NULL;

}

if (m_pYBufT2 != NULL)

{

ippFree(m_pYBufT2);

m_pYBufT2 = NULL;

}

if (m_pPhaseBuf != NULL)

{

ippFree(m_pPhaseBuf);

m_pPhaseBuf = NULL;

}

if (m_pMagnitudes != NULL)

{

ippFree(m_pMagnitudes);

m_pMagnitudes = NULL;

}

if (m_pPhaseDiffMap != NULL)

{

ippFree(m_pPhaseDiffMap);

m_pPhaseDiffMap = NULL;

}

if (m_pComplexNumberImage != NULL)

{

ippFree(m_pComplexNumberImage);

m_pComplexNumberImage = NULL;

}

ippiFFTFree_C_32fc( pCFFTSpec );

ippiFFTFree_R_32f( pFFTspec );

LeaveCriticalSection(&m_cs);

DeleteCriticalSection(&m_cs);

}

void

MotionDetector::InitMotionDetector(int x, int y, Ipp8u* pYBuf, VIDEOINFOHEADER* pVIH)

{

m_x = x;

m_y = y;

m_pYBuf = pYBuf;

m_HStride = pVIH->bmiHeader.biWidth;

m_VStride = pVIH->bmiHeader.biHeight;

}

void

MotionDetector::ApplyWindow(int x, int y)

{

int fftStride = fftSize.width * sizeof(Ipp32f);

Ipp8u* pSrc = m_pYBuf + x + y * m_HStride;

if (ippStsNoErr != ippiConvert_8u32f_C1R(pSrc, m_HStride, m_pYBufT1, fftStride, fftSize))

throw 5;

if (ippStsNoErr != ippiWinBartlett_32f_C1IR(m_pYBufT1, fftStride, fftSize))

throw 4;

}

Ipp32f* MotionDet ector::AllocateFloatBuf(IppiSize* pRoi)

{

int sz = pRoi->width * pRoi->height;

Ipp32f* pFTBuf = (Ipp32f*)ippMalloc(sz*

sizeof(Ipp32f));

//ippsZero_32s(m_pFTBuf,sz);

if (pFTBuf == NULL)

throw 7;

return pFTBuf;

}

void

MotionDetector::ApplyFFT()

{

if (ippStsNoErr != ippiFFTFwd_RToPack_32f_C1R( m_pYBufT1, fftSize.width*sizeof(Ipp32f), m_pYBufT2, fftSize.width*sizeof(Ipp32f), pFFTspec, NULL ))

throw 6;

}

// IppStatus ippiMaxIndx_(const Ipp* pSrc, int srcStep,

// IppiSize roiSize, Ipp* pMax, int* pIndexX, int* pIndexY);

void

MotionDetector::SearchMax( int* pX, int* pY )

{

for (int i = 1; i < 5; i++)

{

pX = 128; pY = 128;

}

int i;

Ipp32f max = 0.0;

*pX = 0; *pY = 0;

int imSize = fftSize.width * fftSize.height;

for (i = 0; i < imSize; i++)

{

//m_pYBufT1 = pCF.re;

if (m_pComplexNumberImage.re > max)

{

max = m_pComplexNumberImage.re;

pX[0] = i % fftSize.width;

pY[0] = i / fftSize.width;

}

}

// shift four quadrants such that each original corner

// meet at the center

pX[0] = (pX[0] + halfW) % fftSize.width;

pX[0] = fftSize.width - pX[0];

// r everse direction of hor. component

pY[0] = (pY[0] + halfH) % fftSize.height;

#ifdef

TRACEVECTORS

TraceVectors::AddVector(m_x, m_y, pX[0], pY[0]);

#endif

}

// store real component of complex number image in m_pYBufT1.

void

MotionDetector::ComplexToReal( Ipp32fc* pCF )

{

int i;

int imSize = fftSize.width * fftSize.height;

for (i = 0; i < imSize; i++)

{

m_pYBufT2 = pCF.re;

}

}

// Rearrange four quadrants of image

// Shift each quadrant towards the center,

// such that four original corners are joined in the center.

// This makes the center correspond to zero flow vector.

void

MotionDetector::Swap()

{

int x, y;

int halfH = fftSize.height/2;

int halfW = fftSize.width/2;

int i;

Ipp32f tmpf;

for (y = 0; y < halfH; y++)

{

Ipp32f* pTop = m_pYBufT2 + y*fftSize.width;

Ipp32f* pBot = m_pYBufT2 + (y + halfH)*fftSize.width;

for (x = 0; x < fftSize.width; x++)

{

tmpf = pTop;

i = (x + halfW)%fftSize.width;

pTop = pBot;

pBot = tmpf;

}

}

}

void

MotionDetector::SwitchBuffers(Ipp32f** buf1, Ipp32f** buf2)

{

Ipp32f* tmp = *buf1;

*buf1 = *buf2;

*buf2 = tmp;

}

void

MotionDetector::CalculatePhaseCorrelation()

{

// calculate phase values for current frame:

ippiPhasePack_32f_C1R(m_pYBufT2, fftSize.width*

sizeof(Ipp32f), m_pYBufT1, fftSize.width*sizeof(Ipp32f), fftSize);

// subtract previous phase values (m_pPhaseBuf) from current phase values, store results in m_pPhaseBuf:

ippiSub_32f_C1IR(m_pYBufT1, fftSize.width*

sizeof(Ipp32f), m_pPhaseBuf, fftSize.width*sizeof(Ipp32f), fftSize);

// convert phase values and constant magnitudes to complex number values:

ippiPolarToCart_32f32fc_P2C1R(m_pMagnitudes, m_pPhaseBuf, fftSize.width*

sizeof(Ipp32f), m_pPhaseDiffMap, fftSize.width * sizeof(Ipp32fc), fftSize);

// calculate inverse FFT of complex number phase values:

ippiFFTInv_CToC_32fc_C1R(m_pPhaseDiffMap, fftSize.width *

sizeof(Ipp32fc), m_pComplexNumberImage, fftSize.width * sizeof(Ipp32fc), pCFFTSpec, 0);

}

void

MotionDetector::CalculateFlowVector()

{

ApplyWindow(m_x, m_y);

// output is m_pYBufT1

if (m_bEnabled)

{

ApplyFFT();

// input is m_pYBufT1, output is m_pYBufT2

CalculatePhaseCorrelation();

// input is m_pYBufT2, output is m_pYBufT1

if (m_bDrawImage) // not the vector, just the phase correlation map

{

ComplexToReal(m_pComplexNumberImage);

// copy real component to m_pYBufT2

Swap();

Ipp8u* pDest = m_pYBuf + m_x + m_y * m_HStride;

if (ippStsNoErr != ippiConvert_32f8u_C1R(m_pYBufT2, fftSize.width * sizeof(Ipp32f),

pDest, m_HStride, fftSize, ippRndZero))

throw 11;

}

// current phase values become previous phase values:

//CopyMemory(m_pPhaseBuf, m_pYBufT1, fftSize.wid th*fftSize.height*sizeof(Ipp32f));

SwitchBuffers(&m_pPhaseBuf, &m_pYBufT1);

}

}

0 Kudos
ctapang
Beginner
471 Views
One thing I have failed to mention is that I am doing my own threading (not IPP threading). The "MotionDetector" class above is instantiated insix separate threads. This "home-made" threading has performed very well in 5.1 using the Pentium 4 CPU. I will have to do a complete redesign to avail of IPP threading. Working under the theory that IPP threading is interfering with the home-made threading, I will try to explicitly turn OFF IPP threading and will report on the results.
0 Kudos
ctapang
Beginner
471 Views

I tried setting the environment variable OMP_NUM_THREADS = 1, and it's still very slow on the dual-proc Centrino. CPU utilization goes up to 100%, and the FFT computations can't catch up with the video frames. On the old Pentium 4, CPU utilization goes up to only about 50%, and there's interframe idle time.

Vladimir or Amanda, if you would like to look at this problem seriously, I can send you directlya zip file with all of the app code in it so you can see for yourself. I really need to get this working on my Centrino in order to be able to demo this appusing my laptop. Otherwise, I will have to use my desktop (which is heavy and bulky), just to be able to show this app to people. Itreally does not look good for Intel that a new processor with dual cores inside (albeit designed for mobile computing), is slower than a three-year old Pentium 4 CPU. I will be writing an article about this app in a technical magazine, and I really don't want to mention that the Centrino sucks.

0 Kudos
Vladimir_Dudnik
Employee
471 Views

Hi Carlos,

I've created issue report on Intel Premier Support for you regarding this case. Please expect you will be contacted soon.

Regards,
Vladimir

0 Kudos
Reply