- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.

Here's the code:

//#define INTELPRIMITIVES

#ifdef INTELPRIMITIVES

#include "ipp.h"

#endif

static void Linearize(unsigned short* inData, float* outData,

const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){

//float a = -9.9546e-5 * (log(2.0)/log(1.778));

float a = inLinearizeConst * (log(2.0)/log(1.778));

int i;

const int theSize = inXSize*inYSize;

#ifndef INTELPRIMITIVES

for (i = 0; i < theSize; i++){

outData

*= (65535.0f - 65535.0f * exp(a * (float)inData*

*));*

}

#else

ippStaticInit();//just to make sure

//first, copy the given data vector into a float vector

//then, multiply each element in the vector with a

//then, raise the vector by exp

//then, multiply by 65535

//then, subtract the vector from 65535

//then, place into the output vector

//and this should be faster?

Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);

Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);

ippsConvert_16u32f(inData, pStart, theSize);

ippsMulC_32f(pStart, a, pTmp, theSize);

//now, do the raising, back into start

ippsExp_32f(pTmp, pStart, theSize);

//now, the multiplication, going back the other way

ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);

//now, the element-by-element subtraction

//first, set the pStart array to be 65535

//then place into a third array, which will be copied out and returned.

ippsSet_32f(65535.0f, pStart, theSize);

//Ipp32f* pFinal = ippsMalloc_32f(theSize);

ippsSub_32f(pTmp, pStart, outData, theSize);

ippsFree(pStart);

ippsFree(pTmp);

#endif

};

If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo?

Thanks!

}

#else

ippStaticInit();//just to make sure

//first, copy the given data vector into a float vector

//then, multiply each element in the vector with a

//then, raise the vector by exp

//then, multiply by 65535

//then, subtract the vector from 65535

//then, place into the output vector

//and this should be faster?

Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);

Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);

ippsConvert_16u32f(inData, pStart, theSize);

ippsMulC_32f(pStart, a, pTmp, theSize);

//now, do the raising, back into start

ippsExp_32f(pTmp, pStart, theSize);

//now, the multiplication, going back the other way

ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);

//now, the element-by-element subtraction

//first, set the pStart array to be 65535

//then place into a third array, which will be copied out and returned.

ippsSet_32f(65535.0f, pStart, theSize);

//Ipp32f* pFinal = ippsMalloc_32f(theSize);

ippsSub_32f(pTmp, pStart, outData, theSize);

ippsFree(pStart);

ippsFree(pTmp);

#endif

};

If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo?

Thanks!

Link Copied

1 Reply

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello,

in your case there is a huge (4Kx4K!) intermediate buffers, which cause your algorithm trash the processor cache at each processing stage. You'd better your slicing to process images by relatevely small parts, which fit into cache and you may gat up to 3X speedup just because of that.

If you have IPP 6.0 beta I would recommend you to take a look on Intel Deferred Mode Image Processing Layer which was specifically developed to simplify coding of calculation pipelines with slicing andutilizing threading capability of modern processors to parallelize processing on slice level.

Regards,

Vladimir

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page