topic What kind of speed increase should I be seeing? in IntelĀ® Integrated Performance Primitives
https://community.intel.com/t5/Intel-Integrated-Performance/What-kind-of-speed-increase-should-I-be-seeing/m-p/906853#M13598
Hi everyone,<BR /><BR />I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.<BR /><BR />Here's the code:<BR />//#define INTELPRIMITIVES<BR /><BR />#ifdef INTELPRIMITIVES<BR />#include "ipp.h"<BR />#endif<BR /> static void Linearize(unsigned short* inData, float* outData, <BR /> const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){<BR /><BR /> //float a = -9.9546e-5 * (log(2.0)/log(1.778));<BR /> float a = inLinearizeConst * (log(2.0)/log(1.778));<BR /><BR /> int i;<BR /> const int theSize = inXSize*inYSize;<BR />#ifndef INTELPRIMITIVES<BR /> for (i = 0; i < theSize; i++){<BR /> outData<I> = (65535.0f - 65535.0f * exp(a * (float)inData<I>));<BR /> }<BR />#else<BR /> ippStaticInit();//just to make sure<BR /> //first, copy the given data vector into a float vector<BR /> //then, multiply each element in the vector with a<BR /> //then, raise the vector by exp<BR /> //then, multiply by 65535<BR /> //then, subtract the vector from 65535<BR /> //then, place into the output vector<BR /> //and this should be faster?<BR /><BR /> Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);<BR /> Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);<BR /><BR /> ippsConvert_16u32f(inData, pStart, theSize);<BR /><BR /> ippsMulC_32f(pStart, a, pTmp, theSize);<BR /> //now, do the raising, back into start<BR /> ippsExp_32f(pTmp, pStart, theSize);<BR /> //now, the multiplication, going back the other way<BR /> ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);<BR /> //now, the element-by-element subtraction<BR /> //first, set the pStart array to be 65535<BR /> //then place into a third array, which will be copied out and returned.<BR /> ippsSet_32f(65535.0f, pStart, theSize);<BR /><BR /> //Ipp32f* pFinal = ippsMalloc_32f(theSize);<BR /> ippsSub_32f(pTmp, pStart, outData, theSize);<BR /><BR /> ippsFree(pStart);<BR /> ippsFree(pTmp);<BR /><BR />#endif<BR /> };<BR /><BR />If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin
g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo? <BR /><BR />Thanks!<BR /></I></I>Thu, 07 Aug 2008 18:06:06 GMTmmroden2008-08-07T18:06:06ZWhat kind of speed increase should I be seeing?
https://community.intel.com/t5/Intel-Integrated-Performance/What-kind-of-speed-increase-should-I-be-seeing/m-p/906853#M13598
Hi everyone,<BR /><BR />I've got a routine here that I want to make faster. The size of the input data is roughly 4000x4000 (similar to my previous problem) and I want to see some kind of speedup using intel primitives, but I'm just not seeing it.<BR /><BR />Here's the code:<BR />//#define INTELPRIMITIVES<BR /><BR />#ifdef INTELPRIMITIVES<BR />#include "ipp.h"<BR />#endif<BR /> static void Linearize(unsigned short* inData, float* outData, <BR /> const int inXSize, const int inYSize, float inLinearizeConst = -9.9546e-5){<BR /><BR /> //float a = -9.9546e-5 * (log(2.0)/log(1.778));<BR /> float a = inLinearizeConst * (log(2.0)/log(1.778));<BR /><BR /> int i;<BR /> const int theSize = inXSize*inYSize;<BR />#ifndef INTELPRIMITIVES<BR /> for (i = 0; i < theSize; i++){<BR /> outData<I> = (65535.0f - 65535.0f * exp(a * (float)inData<I>));<BR /> }<BR />#else<BR /> ippStaticInit();//just to make sure<BR /> //first, copy the given data vector into a float vector<BR /> //then, multiply each element in the vector with a<BR /> //then, raise the vector by exp<BR /> //then, multiply by 65535<BR /> //then, subtract the vector from 65535<BR /> //then, place into the output vector<BR /> //and this should be faster?<BR /><BR /> Ipp32f* pStart = ippsMalloc_32f(inXSize*inYSize);<BR /> Ipp32f* pTmp = ippsMalloc_32f(inXSize*inYSize);<BR /><BR /> ippsConvert_16u32f(inData, pStart, theSize);<BR /><BR /> ippsMulC_32f(pStart, a, pTmp, theSize);<BR /> //now, do the raising, back into start<BR /> ippsExp_32f(pTmp, pStart, theSize);<BR /> //now, the multiplication, going back the other way<BR /> ippsMulC_32f(pStart, 65535.0f, pTmp, theSize);<BR /> //now, the element-by-element subtraction<BR /> //first, set the pStart array to be 65535<BR /> //then place into a third array, which will be copied out and returned.<BR /> ippsSet_32f(65535.0f, pStart, theSize);<BR /><BR /> //Ipp32f* pFinal = ippsMalloc_32f(theSize);<BR /> ippsSub_32f(pTmp, pStart, outData, theSize);<BR /><BR /> ippsFree(pStart);<BR /> ippsFree(pTmp);<BR /><BR />#endif<BR /> };<BR /><BR />If I add or remove that #define, no appreciable difference in speed is seen. Is it that that loop is already vectorized by the intel compiler, and so the explicit unrolling of the math isn't doing anything? Is usin
g ::GetTickCount() accurate enough for this kind of timing? On images of that size, I'm seeing timing of around half a second for this routine; should that be faster, on a 2.4 Ghz core 2 duo? <BR /><BR />Thanks!<BR /></I></I>Thu, 07 Aug 2008 18:06:06 GMThttps://community.intel.com/t5/Intel-Integrated-Performance/What-kind-of-speed-increase-should-I-be-seeing/m-p/906853#M13598mmroden2008-08-07T18:06:06ZRe: What kind of speed increase should I be seeing?
https://community.intel.com/t5/Intel-Integrated-Performance/What-kind-of-speed-increase-should-I-be-seeing/m-p/906854#M13599
<P>Hello,</P>
<P>in your case there is a huge (4Kx4K!) intermediate buffers, which cause your algorithm trash the processor cache at each processing stage. You'd better your slicing to process images by relatevely small parts, which fit into cache and you may gat up to 3X speedup just because of that.</P>
<P>If you have IPP 6.0 beta I would recommend you to take a look on Intel Deferred Mode Image Processing Layer which was specifically developed to simplify coding of calculation pipelines with slicing andutilizing threading capability of modern processors to parallelize processing on slice level.</P>
<P>Regards,<BR /> Vladimir</P>Thu, 07 Aug 2008 18:44:33 GMThttps://community.intel.com/t5/Intel-Integrated-Performance/What-kind-of-speed-increase-should-I-be-seeing/m-p/906854#M13599Vladimir_Dudnik2008-08-07T18:44:33Z