- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using FFT routines, and I notice that these run much slower on my laptop (Centrino T7500) than on my hyperthreading desktop. Is this expected? If so, why? Thanks in advance.
--Carlos
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Carlos,
What processor is installed on your hyperthreading desktop?
Anf note please, if you link your application with IPP static libraries you need to call ippStaticInit() function somewhere at the beginning of your application to initialize IPP static dispatcher.
You may check what IPP processor-specific library is loaded by default on your system. Just run ippiDemo executable and choose Help->About from the application menu. It should show that T7 family of libraries are loaded on T7500.
Of course you might be interested to upgrade to the latest version of IPP, which currently is IPP 5.3 (where is V8 processor-specific libraries which are specifically targeted to Intel Core2 architecture).
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, Vladimir.
My desktop processor is a 3-year old Pentium 4 processor, running at 3 GHz and 800MHz front-side bus. It is hyperthreading, has MMX, and SSE/SSE2/SSE3.
I tried IPP Version 5.3, and it's still slow on the Centrino, about 4 times slower than 5.1 running on the old Pentium 4. When I run the demo ps_ippac.exe, it reports that my Centrino is a Core 2 Duo processor 2x2200 MHz, L1=32/32K.
I ran the Processor Identification Utility on the Centrino, and it reports that it has SSE/SSE2/SSE3, but no MMX. So may be not having MMX is a huge factor? If so, then I just bought me a castrated CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actuallyavailability of MMX does not affect performance on the newest processors as in IPP we mostly use SSEx instruction set.
Remember, that if you link your application with IPP static libraries you need to call ippStaticInit() function at the beginning of your program otherwise PX code (which does not use MMX, SSX, SSE) will be dispatched in IPP.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vladimir,
I am not linking with the static libraries. I am letting IPP decide which dynamic library to use. GetCPUType returns ippCPUC2D on version 5.3.1.064. Is there a way to find out exactly which dynamic libraries are being loaded? I appreciate your help.
Thanks.
--Carlos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If ippGetCpuType return correct value it is most probably everythink is ok with dispatching IPP processoer-specific code. You may ensure that by call ippGetLibVersion() and print out the values in IppLibraryVersion structure. Code like this does that trick:
const IppLibraryVersion* v = ippjGetLibVersion();
printf("Intel Integrated Performance Primitives ");
printf(" version: %s, [%d.%d.%d.%d] ",
v->Version, v->major, v->minor, v->majorBuild, v->build);
printf(" name: %s ", v->Name);
printf(" date: %s ", v->BuildDate);
Name of library contains two letter prefix which identifies processor specific code. For example, for Intel Core2 architecture on 32-bit OS it will be V8 (and U8 on 64-bit OS)
Could you please also share your test case, might be there are some issues with range of input data or something like that?
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's what I get from ippjGetLibVersion:
IPP: version: 5.3 Update 1 build 85.18, [5.3.85.477], name: ippjv8-5.3.dll, date: Nov 17 2007
And so the correct DLL is being loaded.
Let me try to attach a couple of files to this message.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is part of the code I am running. Please note again that this same coderuns fast on the Pentium 4, but very slow on the Centrino.
#include
"ipp.h"#include
"NearestNeighbors.h"#if
MODE!=2#include "LocalMaxima.h"
#endif
#include
"StdAfx.h"#include
"MotionDetector.h"#ifdef
TRACEVECTORS// TraceVectors is not part of the VS project so that it does not get linked in if _DEBUG is OFF.
#include "TraceVectors.cpp"
#endif
/// MotionDetector class
MotionDetector::MotionDetector(
void){
InitializeCriticalSection( &m_cs );
fftSize.width = 1<
fftSize.height = 1<
halfW = fftSize.width / 2;
halfH = fftSize.height / 2;
m_pYBuf = NULL;
m_bEnabled =
true;m_bDrawImage =
false;ippiFFTInitAlloc_R_32f( &pFFTspec, orderX, orderY, IPP_FFT_DIV_BY_SQRTN, ippAlgHintAccurate );
ippiFFTInitAlloc_C_32fc( &pCFFTSpec, orderX, orderY, IPP_FFT_DIV_BY_SQRTN, ippAlgHintAccurate );
m_pMagnitudes = AllocateFloatBuf(&fftSize);
m_pYBufT1 = AllocateFloatBuf(&fftSize);
m_pYBufT2 = AllocateFloatBuf(&fftSize);
m_pPhaseBuf = AllocateFloatBuf(&fftSize);
int sz = fftSize.width * fftSize.height * sizeof(Ipp32fc);m_pPhaseDiffMap = (Ipp32fc*)ippMalloc( sz );
m_pComplexNumberImage = (Ipp32fc*)ippMalloc( sz );
int x, y;for (y = 0; y < fftSize.height; y++){
Ipp32f* pRow = m_pMagnitudes + y * fftSize.width;
for (x = 0; x < fftSize.width; x++){
pRow
= 66.0; }
}
}
MotionDetector::~MotionDetector(
void){
EnterCriticalSection(&m_cs);
if (m_pYBufT1 != NULL){
ippFree(m_pYBufT1);
m_pYBufT1 = NULL;
}
if (m_pYBufT2 != NULL){
ippFree(m_pYBufT2);
m_pYBufT2 = NULL;
}
if (m_pPhaseBuf != NULL){
ippFree(m_pPhaseBuf);
m_pPhaseBuf = NULL;
}
if (m_pMagnitudes != NULL){
ippFree(m_pMagnitudes);
m_pMagnitudes = NULL;
}
if (m_pPhaseDiffMap != NULL){
ippFree(m_pPhaseDiffMap);
m_pPhaseDiffMap = NULL;
}
if (m_pComplexNumberImage != NULL){
ippFree(m_pComplexNumberImage);
m_pComplexNumberImage = NULL;
}
ippiFFTFree_C_32fc( pCFFTSpec );
ippiFFTFree_R_32f( pFFTspec );
LeaveCriticalSection(&m_cs);
DeleteCriticalSection(&m_cs);
}
void
MotionDetector::InitMotionDetector(int x, int y, Ipp8u* pYBuf, VIDEOINFOHEADER* pVIH){
m_x = x;
m_y = y;
m_pYBuf = pYBuf;
m_HStride = pVIH->bmiHeader.biWidth;
m_VStride = pVIH->bmiHeader.biHeight;
}
void
MotionDetector::ApplyWindow(int x, int y){
int fftStride = fftSize.width * sizeof(Ipp32f);Ipp8u* pSrc = m_pYBuf + x + y * m_HStride;
if (ippStsNoErr != ippiConvert_8u32f_C1R(pSrc, m_HStride, m_pYBufT1, fftStride, fftSize))throw 5;if (ippStsNoErr != ippiWinBartlett_32f_C1IR(m_pYBufT1, fftStride, fftSize))throw 4;}
Ipp32f* MotionDet ector::AllocateFloatBuf(IppiSize* pRoi)
{
int sz = pRoi->width * pRoi->height;Ipp32f* pFTBuf = (Ipp32f*)ippMalloc(sz*
sizeof(Ipp32f));//ippsZero_32s(m_pFTBuf,sz);if (pFTBuf == NULL)throw 7;return pFTBuf;}
void
MotionDetector::ApplyFFT(){
if (ippStsNoErr != ippiFFTFwd_RToPack_32f_C1R( m_pYBufT1, fftSize.width*sizeof(Ipp32f), m_pYBufT2, fftSize.width*sizeof(Ipp32f), pFFTspec, NULL ))throw 6;}
// IppStatus ippiMaxIndx_
(const Ipp * pSrc, int srcStep, // IppiSize roiSize, Ipp
* pMax, int* pIndexX, int* pIndexY); void
MotionDetector::SearchMax( int* pX, int* pY ){
for (int i = 1; i < 5; i++){
pX = 128; pY = 128;
}
int i;Ipp32f max = 0.0;
*pX = 0; *pY = 0;
int imSize = fftSize.width * fftSize.height;for (i = 0; i < imSize; i++){
//m_pYBufT1 = pCF.re;if (m_pComplexNumberImage.re > max){
max = m_pComplexNumberImage.re;
pX[0] = i % fftSize.width;
pY[0] = i / fftSize.width;
}
}
// shift four quadrants such that each original corner// meet at the centerpX[0] = (pX[0] + halfW) % fftSize.width;
pX[0] = fftSize.width - pX[0];
// r everse direction of hor. componentpY[0] = (pY[0] + halfH) % fftSize.height;
#ifdef
TRACEVECTORSTraceVectors::AddVector(m_x, m_y, pX[0], pY[0]);
#endif
}
// store real component of complex number image in m_pYBufT1.
void
MotionDetector::ComplexToReal( Ipp32fc* pCF ){
int i;int imSize = fftSize.width * fftSize.height;for (i = 0; i < imSize; i++){
m_pYBufT2 = pCF.re;
}
}
// Rearrange four quadrants of image
// Shift each quadrant towards the center,
// such that four original corners are joined in the center.
// This makes the center correspond to zero flow vector.
void
MotionDetector::Swap(){
int x, y;int halfH = fftSize.height/2;int halfW = fftSize.width/2;int i;Ipp32f tmpf;
for (y = 0; y < halfH; y++){
Ipp32f* pTop = m_pYBufT2 + y*fftSize.width;
Ipp32f* pBot = m_pYBufT2 + (y + halfH)*fftSize.width;
for (x = 0; x < fftSize.width; x++){
tmpf = pTop
; i = (x + halfW)%fftSize.width;
pTop
= pBot; pBot = tmpf;
}
}
}
void
MotionDetector::SwitchBuffers(Ipp32f** buf1, Ipp32f** buf2){
Ipp32f* tmp = *buf1;
*buf1 = *buf2;
*buf2 = tmp;
}
void
MotionDetector::CalculatePhaseCorrelation(){
// calculate phase values for current frame:ippiPhasePack_32f_C1R(m_pYBufT2, fftSize.width*
sizeof(Ipp32f), m_pYBufT1, fftSize.width*sizeof(Ipp32f), fftSize);// subtract previous phase values (m_pPhaseBuf) from current phase values, store results in m_pPhaseBuf:ippiSub_32f_C1IR(m_pYBufT1, fftSize.width*
sizeof(Ipp32f), m_pPhaseBuf, fftSize.width*sizeof(Ipp32f), fftSize);// convert phase values and constant magnitudes to complex number values:ippiPolarToCart_32f32fc_P2C1R(m_pMagnitudes, m_pPhaseBuf, fftSize.width*
sizeof(Ipp32f), m_pPhaseDiffMap, fftSize.width * sizeof(Ipp32fc), fftSize);// calculate inverse FFT of complex number phase values:ippiFFTInv_CToC_32fc_C1R(m_pPhaseDiffMap, fftSize.width *
sizeof(Ipp32fc), m_pComplexNumberImage, fftSize.width * sizeof(Ipp32fc), pCFFTSpec, 0);}
void
MotionDetector::CalculateFlowVector(){
ApplyWindow(m_x, m_y);
// output is m_pYBufT1if (m_bEnabled){
ApplyFFT();
// input is m_pYBufT1, output is m_pYBufT2CalculatePhaseCorrelation();
// input is m_pYBufT2, output is m_pYBufT1if (m_bDrawImage) // not the vector, just the phase correlation map{
ComplexToReal(m_pComplexNumberImage);
// copy real component to m_pYBufT2Swap();
Ipp8u* pDest = m_pYBuf + m_x + m_y * m_HStride;
if (ippStsNoErr != ippiConvert_32f8u_C1R(m_pYBufT2, fftSize.width * sizeof(Ipp32f),pDest, m_HStride, fftSize, ippRndZero))
throw 11;}
// current phase values become previous phase values://CopyMemory(m_pPhaseBuf, m_pYBufT1, fftSize.wid th*fftSize.height*sizeof(Ipp32f));SwitchBuffers(&m_pPhaseBuf, &m_pYBufT1);
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried setting the environment variable OMP_NUM_THREADS = 1, and it's still very slow on the dual-proc Centrino. CPU utilization goes up to 100%, and the FFT computations can't catch up with the video frames. On the old Pentium 4, CPU utilization goes up to only about 50%, and there's interframe idle time.
Vladimir or Amanda, if you would like to look at this problem seriously, I can send you directlya zip file with all of the app code in it so you can see for yourself. I really need to get this working on my Centrino in order to be able to demo this appusing my laptop. Otherwise, I will have to use my desktop (which is heavy and bulky), just to be able to show this app to people. Itreally does not look good for Intel that a new processor with dual cores inside (albeit designed for mobile computing), is slower than a three-year old Pentium 4 CPU. I will be writing an article about this app in a technical magazine, and I really don't want to mention that the Centrino sucks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Carlos,
I've created issue report on Intel Premier Support for you regarding this case. Please expect you will be contacted soon.
Regards,
Vladimir
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page