What kind of speed increase should I be seeing?

mmroden · ‎08-07-2008

Hi everyone,

I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.

Here's the code:
//#define INTELPRIMITIVES

#ifdef INTELPRIMITIVES
#include "ipp.h"
#endif
static void Linearize(unsigned short* inData, float* outData,
const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){

//float a = -9.9546e-5 * (log(2.0)/log(1.778));
float a = inLinearizeConst * (log(2.0)/log(1.778));

int i;
const int theSize = inXSize*inYSize;
#ifndef INTELPRIMITIVES
for (i = 0; i < theSize; i++){
outData = (65535.0f - 65535.0f * exp(a * (float)inData));
}
#else
ippStaticInit();//just to make sure
//first, copy the given data vector into a float vector
//then, multiply each element in the vector with a
//then, raise the vector by exp
//then, multiply by 65535
//then, subtract the vector from 65535
//then, place into the output vector
//and this should be faster?

Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);
Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);

ippsConvert_16u32f(inData, pStart, theSize);

ippsMulC_32f(pStart, a, pTmp, theSize);
//now, do the raising, back into start
ippsExp_32f(pTmp, pStart, theSize);
//now, the multiplication, going back the other way
ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);
//now, the element-by-element subtraction
//first, set the pStart array to be 65535
//then place into a third array, which will be copied out and returned.
ippsSet_32f(65535.0f, pStart, theSize);

//Ipp32f* pFinal = ippsMalloc_32f(theSize);
ippsSub_32f(pTmp, pStart, outData, theSize);

ippsFree(pStart);
ippsFree(pTmp);

#endif
};

If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo?

Thanks!

Vladimir_Dudnik · ‎08-07-2008

Hello,

in your case there is a huge (4Kx4K!) intermediate buffers, which cause your algorithm trash the processor cache at each processing stage. You'd better your slicing to process images by relatevely small parts, which fit into cache and you may gat up to 3X speedup just because of that.

If you have IPP 6.0 beta I would recommend you to take a look on Intel Deferred Mode Image Processing Layer which was specifically developed to simplify coding of calculation pipelines with slicing andutilizing threading capability of modern processors to parallelize processing on slice level.

Regards,
Vladimir