Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

## What kind of speed increase should I be seeing?

Beginner
358 Views
Hi everyone,

I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.

Here's the code:
//#define INTELPRIMITIVES

#ifdef INTELPRIMITIVES
#include "ipp.h"
#endif
static void Linearize(unsigned short* inData, float* outData,
const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){

//float a = -9.9546e-5 * (log(2.0)/log(1.778));
float a = inLinearizeConst * (log(2.0)/log(1.778));

int i;
const int theSize = inXSize*inYSize;
#ifndef INTELPRIMITIVES
for (i = 0; i < theSize; i++){
outData = (65535.0f - 65535.0f * exp(a * (float)inData));
}
#else
ippStaticInit();//just to make sure
//first, copy the given data vector into a float vector
//then, multiply each element in the vector with a
//then, raise the vector by exp
//then, multiply by 65535
//then, subtract the vector from 65535
//then, place into the output vector
//and this should be faster?

Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);
Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);

ippsConvert_16u32f(inData, pStart, theSize);

ippsMulC_32f(pStart, a, pTmp, theSize);
//now, do the raising, back into start
ippsExp_32f(pTmp, pStart, theSize);
//now, the multiplication, going back the other way
ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);
//now, the element-by-element subtraction
//first, set the pStart array to be 65535
//then place into a third array, which will be copied out and returned.
ippsSet_32f(65535.0f, pStart, theSize);

//Ipp32f* pFinal = ippsMalloc_32f(theSize);
ippsSub_32f(pTmp, pStart, outData, theSize);

ippsFree(pStart);
ippsFree(pTmp);

#endif
};

If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo?

Thanks!