- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.
Here's the code:
//#define INTELPRIMITIVES
#ifdef INTELPRIMITIVES
#include "ipp.h"
#endif
static void Linearize(unsigned short* inData, float* outData,
const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){
//float a = -9.9546e-5 * (log(2.0)/log(1.778));
float a = inLinearizeConst * (log(2.0)/log(1.778));
int i;
const int theSize = inXSize*inYSize;
#ifndef INTELPRIMITIVES
for (i = 0; i < theSize; i++){
outData = (65535.0f - 65535.0f * exp(a * (float)inData));
}
#else
ippStaticInit();//just to make sure
//first, copy the given data vector into a float vector
//then, multiply each element in the vector with a
//then, raise the vector by exp
//then, multiply by 65535
//then, subtract the vector from 65535
//then, place into the output vector
//and this should be faster?
Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);
Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);
ippsConvert_16u32f(inData, pStart, theSize);
ippsMulC_32f(pStart, a, pTmp, theSize);
//now, do the raising, back into start
ippsExp_32f(pTmp, pStart, theSize);
//now, the multiplication, going back the other way
ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);
//now, the element-by-element subtraction
//first, set the pStart array to be 65535
//then place into a third array, which will be copied out and returned.
ippsSet_32f(65535.0f, pStart, theSize);
//Ipp32f* pFinal = ippsMalloc_32f(theSize);
ippsSub_32f(pTmp, pStart, outData, theSize);
ippsFree(pStart);
ippsFree(pTmp);
#endif
};
If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo?
Thanks!
I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.
Here's the code:
//#define INTELPRIMITIVES
#ifdef INTELPRIMITIVES
#include "ipp.h"
#endif
static void Linearize(unsigned short* inData, float* outData,
const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){
//float a = -9.9546e-5 * (log(2.0)/log(1.778));
float a = inLinearizeConst * (log(2.0)/log(1.778));
int i;
const int theSize = inXSize*inYSize;
#ifndef INTELPRIMITIVES
for (i = 0; i < theSize; i++){
outData = (65535.0f - 65535.0f * exp(a * (float)inData));
}
#else
ippStaticInit();//just to make sure
//first, copy the given data vector into a float vector
//then, multiply each element in the vector with a
//then, raise the vector by exp
//then, multiply by 65535
//then, subtract the vector from 65535
//then, place into the output vector
//and this should be faster?
Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);
Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);
ippsConvert_16u32f(inData, pStart, theSize);
ippsMulC_32f(pStart, a, pTmp, theSize);
//now, do the raising, back into start
ippsExp_32f(pTmp, pStart, theSize);
//now, the multiplication, going back the other way
ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);
//now, the element-by-element subtraction
//first, set the pStart array to be 65535
//then place into a third array, which will be copied out and returned.
ippsSet_32f(65535.0f, pStart, theSize);
//Ipp32f* pFinal = ippsMalloc_32f(theSize);
ippsSub_32f(pTmp, pStart, outData, theSize);
ippsFree(pStart);
ippsFree(pTmp);
#endif
};
If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo?
Thanks!
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
in your case there is a huge (4Kx4K!) intermediate buffers, which cause your algorithm trash the processor cache at each processing stage. You'd better your slicing to process images by relatevely small parts, which fit into cache and you may gat up to 3X speedup just because of that.
If you have IPP 6.0 beta I would recommend you to take a look on Intel Deferred Mode Image Processing Layer which was specifically developed to simplify coding of calculation pipelines with slicing andutilizing threading capability of modern processors to parallelize processing on slice level.
Regards,
Vladimir
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page