<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Performance boost is not as expected using SSE intrinsics in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911254#M2973</link>
    <description>&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=255766" class="basic" href="https://community.intel.com/en-us/profile/255766/"&gt;bronxzv&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;
&lt;P&gt;&lt;I&gt;How can I estimate the performance boost when&lt;BR /&gt;SSE intrinsics is used?&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;it depends on your compiler, if your code is vectorized by the compiler 20% by using intrinsics is quite good. =&amp;gt; post the ASM dump of the code generated by your compiler and tell us more about your target CPU (for example on Nehalem alignment is no big deal but it can hurt older CPUs like Pentium 4)&lt;/P&gt;
&lt;P&gt;for example with Intel C++ 11.1, this code :&lt;/P&gt;
&lt;SPAN style="font-family: Lucida Console; font-size: x-small;"&gt;&lt;SPAN style="font-family: Lucida Console; font-size: x-small;"&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;unsigned long Jogging (const unsigned char *FixedImg, unsigned long ImageHeight, unsigned long ImageWidth)&lt;/P&gt;
&lt;P&gt;{&lt;/P&gt;
&lt;P&gt;unsigned long SumC = 0;&lt;/P&gt;
&lt;P&gt;const unsigned char *pSrc = FixedImg;&lt;/P&gt;
&lt;P&gt;for (unsigned long i=0; i&lt;IMAGEHEIGHT&gt;
&lt;/IMAGEHEIGHT&gt;&lt;/P&gt;&lt;P&gt;return SumC;&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;is vectorized, ASM of the core loop is below:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;.B8.4: ; Preds .B8.4 .B8.3&lt;/P&gt;
&lt;P&gt;$LN323:&lt;/P&gt;
&lt;P&gt;movd xmm2, DWORD PTR [eax+esi] ;254.58&lt;/P&gt;
&lt;P&gt;punpcklbw xmm2, xmm0 ;254.58&lt;/P&gt;
&lt;P&gt;punpcklwd xmm2, xmm0 ;254.58&lt;/P&gt;
&lt;P&gt;paddd xmm1, xmm2 ;254.58&lt;/P&gt;
&lt;P&gt;$LN325:&lt;/P&gt;
&lt;P&gt;add eax, 4 ;254.3&lt;/P&gt;
&lt;P&gt;cmp eax, edx ;254.3&lt;/P&gt;
&lt;P&gt;jb .B8.4 ; Prob 82% ;254.3&lt;/P&gt;
&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/I&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;BR /&gt;The processor in my laptop is Core 2 Duo T7500. My current c compiler is from VS 2008.&lt;BR /&gt;The assembly code generated is below.&lt;BR /&gt;&lt;BR /&gt;for( i=0; i&lt;WIDTH&gt;&lt;/WIDTH&gt; {&lt;BR /&gt; iresult0 += *pSrc++; &lt;BR /&gt;00401066 movzx eax,byte ptr [ecx] &lt;BR /&gt;00401069 add dword ptr [iresult0],eax &lt;BR /&gt;0040106C inc ecx &lt;BR /&gt;0040106D sub dword ptr [ebp-10h],1 &lt;BR /&gt;00401071 jne main+66h (401066h) &lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;Though one byte is processed each time, its execution speed is pretty fast.&lt;BR /&gt;&lt;BR /&gt;Regards&lt;/P&gt;
&lt;P&gt;Jogging&lt;/P&gt;</description>
    <pubDate>Wed, 20 Jan 2010 11:34:57 GMT</pubDate>
    <dc:creator>joggingsonggmail_com</dc:creator>
    <dc:date>2010-01-20T11:34:57Z</dc:date>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911249#M2968</link>
      <description>Hi,all&lt;BR /&gt;In order to boost performance, I choose &lt;BR /&gt;to program using SSE intrinsics. After &lt;BR /&gt;I measure execution time, I find that &lt;BR /&gt;the improvement is not significant, &lt;BR /&gt;only 20%. &lt;BR /&gt;&lt;BR /&gt;The original program calculate sum &lt;BR /&gt;sequentially, and SSE intrinsics program &lt;BR /&gt;calculate sum with four addition in parallel.&lt;BR /&gt; Several instructions are used for preparation&lt;BR /&gt;data in proper format, but it reduces the nubmer &lt;BR /&gt;of memory load by wide load instruction. &lt;BR /&gt;I expect the execution time is reduced&lt;BR /&gt;to at least 1/3. But the measured result &lt;BR /&gt;is very disappointing.&lt;BR /&gt;&lt;BR /&gt;How can I estimate the performance boost when&lt;BR /&gt;SSE intrinsics is used? Does porgramming using&lt;BR /&gt;SSE intrinsics require programmers know&lt;BR /&gt;the details of processor architecture well?&lt;BR /&gt;&lt;BR /&gt;The following is my code:&lt;BR /&gt; Start = GetCycleCount(); &lt;BR /&gt; SumC = 0;&lt;BR /&gt; pSrc = FixedImg;&lt;BR /&gt; for(i = 0; i &amp;lt; ImageHeight*ImageWidth; i++)&lt;BR /&gt; {&lt;BR /&gt; SumC += *pSrc++;&lt;BR /&gt; }&lt;BR /&gt; Stop = GetCycleCount(); &lt;BR /&gt; printf("Sum C cycle: %d\\n", (Stop - Start)); &lt;BR /&gt; printf("SumC: %d\\n", SumC);&lt;BR /&gt;&lt;BR /&gt; Start = GetCycleCount(); &lt;BR /&gt; SumSSE = 0;&lt;BR /&gt; {&lt;BR /&gt; __m128i Sum, Dat1, Dat2, Dat3;&lt;BR /&gt; __m128i vzero; &lt;BR /&gt;&lt;BR /&gt; pSrc = FixedImg;&lt;BR /&gt; vzero = _mm_setzero_si128(); &lt;BR /&gt; Sum = _mm_setzero_si128(); &lt;BR /&gt; for(i = 0; i &amp;lt; ImageHeight*ImageWidth/16; i++)&lt;BR /&gt; {&lt;BR /&gt; Dat1 = _mm_loadu_si128( (__m128i*)(pSrc));&lt;BR /&gt; Dat2 = _mm_unpacklo_epi8(Dat1, vzero);&lt;BR /&gt; Dat3 = _mm_unpacklo_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat3 = _mm_unpackhi_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi8(Dat1, vzero);&lt;BR /&gt; Dat3 = _mm_unpacklo_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat3 = _mm_unpackhi_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; pSrc += 16;&lt;BR /&gt; }&lt;BR /&gt; Dat1 = _mm_unpacklo_epi32(Sum, vzero);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi32(Sum, vzero);&lt;BR /&gt; Sum = _mm_add_epi64(Dat1, Dat2);&lt;BR /&gt; Dat1 = _mm_unpacklo_epi64(Sum, vzero);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi64(Sum, vzero);&lt;BR /&gt; Sum = _mm_add_epi64(Dat1, Dat2);&lt;BR /&gt;&lt;BR /&gt; SumSSE = _mm_cvtsi128_si32(Sum);&lt;BR /&gt; }&lt;BR /&gt; Stop = GetCycleCount(); &lt;BR /&gt; printf("Sum SSE cycle: %d\\n", (Stop - Start));&lt;BR /&gt; printf("SumSSE: %d\\n", SumSSE);&lt;BR /&gt;&lt;BR /&gt;Best Regards&lt;BR /&gt;Jogging</description>
      <pubDate>Tue, 19 Jan 2010 07:07:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911249#M2968</guid>
      <dc:creator>joggingsonggmail_com</dc:creator>
      <dc:date>2010-01-19T07:07:05Z</dc:date>
    </item>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911250#M2969</link>
      <description>&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=458897" class="basic" href="https://community.intel.com/en-us/profile/458897/"&gt;joggingsonggmail.com&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;Hi,all&lt;BR /&gt;In order to boost performance, I choose &lt;BR /&gt;to program using SSE intrinsics. After &lt;BR /&gt;I measure execution time, I find that &lt;BR /&gt;the improvement is not significant, &lt;BR /&gt;only 20%. &lt;BR /&gt;&lt;BR /&gt;The original program calculate sum &lt;BR /&gt;sequentially, and SSE intrinsics program &lt;BR /&gt;calculate sum with four addition in parallel.&lt;BR /&gt; Several instructions are used for preparation&lt;BR /&gt;data in proper format, but it reduces the nubmer &lt;BR /&gt;of memory load by wide load instruction. &lt;BR /&gt;I expect the execution time is reduced&lt;BR /&gt;to at least 1/3. But the measured result &lt;BR /&gt;is very disappointing.&lt;BR /&gt;&lt;BR /&gt;How can I estimate the performance boost when&lt;BR /&gt;SSE intrinsics is used? Does porgramming using&lt;BR /&gt;SSE intrinsics require programmers know&lt;BR /&gt;the details of processor architecture well?&lt;BR /&gt;&lt;BR /&gt;The following is my code:&lt;BR /&gt; Start = GetCycleCount(); &lt;BR /&gt; SumC = 0;&lt;BR /&gt; pSrc = FixedImg;&lt;BR /&gt; for(i = 0; i &amp;lt; ImageHeight*ImageWidth; i++)&lt;BR /&gt; {&lt;BR /&gt; SumC += *pSrc++;&lt;BR /&gt; }&lt;BR /&gt; Stop = GetCycleCount(); &lt;BR /&gt; printf("Sum C cycle: %d\n", (Stop - Start)); &lt;BR /&gt; printf("SumC: %d\n", SumC);&lt;BR /&gt;&lt;BR /&gt; Start = GetCycleCount(); &lt;BR /&gt; SumSSE = 0;&lt;BR /&gt; {&lt;BR /&gt; __m128i Sum, Dat1, Dat2, Dat3;&lt;BR /&gt; __m128i vzero; &lt;BR /&gt;&lt;BR /&gt; pSrc = FixedImg;&lt;BR /&gt; vzero = _mm_setzero_si128(); &lt;BR /&gt; Sum = _mm_setzero_si128(); &lt;BR /&gt; for(i = 0; i &amp;lt; ImageHeight*ImageWidth/16; i++)&lt;BR /&gt; {&lt;BR /&gt; Dat1 = _mm_loadu_si128( (__m128i*)(pSrc));&lt;BR /&gt; Dat2 = _mm_unpacklo_epi8(Dat1, vzero);&lt;BR /&gt; Dat3 = _mm_unpacklo_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat3 = _mm_unpackhi_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi8(Dat1, vzero);&lt;BR /&gt; Dat3 = _mm_unpacklo_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat3 = _mm_unpackhi_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; pSrc += 16;&lt;BR /&gt; }&lt;BR /&gt; Dat1 = _mm_unpacklo_epi32(Sum, vzero);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi32(Sum, vzero);&lt;BR /&gt; Sum = _mm_add_epi64(Dat1, Dat2);&lt;BR /&gt; Dat1 = _mm_unpacklo_epi64(Sum, vzero);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi64(Sum, vzero);&lt;BR /&gt; Sum = _mm_add_epi64(Dat1, Dat2);&lt;BR /&gt;&lt;BR /&gt; SumSSE = _mm_cvtsi128_si32(Sum);&lt;BR /&gt; }&lt;BR /&gt; Stop = GetCycleCount(); &lt;BR /&gt; printf("Sum SSE cycle: %d\n", (Stop - Start));&lt;BR /&gt; printf("SumSSE: %d\n", SumSSE);&lt;BR /&gt;&lt;BR /&gt;Best Regards&lt;BR /&gt;Jogging&lt;/I&gt;&lt;/DIV&gt;
&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;You should try to get rid of the &lt;I&gt;_mm_loadu_si128&lt;/I&gt; instruction. On some architectures, it could have a significative  impact on performance. You should compute the sum for the N first bytes so that SSE code is applied only on aligned data using &lt;I&gt;_mm_load_si128. &lt;/I&gt;Then you can compute the remainder sum and add the 3.&lt;/P&gt;
&lt;P&gt;You can probably reduce the number of unpack instructions by doing 16 bits adds before unpacking to 32bits&lt;/P&gt;
&lt;P&gt;Matthieu&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2010 13:32:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911250#M2969</guid>
      <dc:creator>matthieu_darbois</dc:creator>
      <dc:date>2010-01-19T13:32:32Z</dc:date>
    </item>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911251#M2970</link>
      <description>&lt;P&gt;&lt;EM&gt;How can I estimate the performance boost when&lt;BR /&gt;SSE intrinsics is used?&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;it depends on your compiler, if your code is vectorized by the compiler 20% by using intrinsics is quite good. =&amp;gt; post the ASM dump of the code generated by your compiler and tell us more about your target CPU (for example on Nehalem alignment is no big deal but it can hurt older CPUs like Pentium 4)&lt;/P&gt;
&lt;P&gt;for example with Intel C++ 11.1, this code :&lt;/P&gt;
&lt;SPAN style="font-family: Lucida Console; font-size: x-small;"&gt;&lt;SPAN style="font-family: Lucida Console; font-size: x-small;"&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;unsigned long Jogging (const unsigned char *FixedImg, unsigned long ImageHeight, unsigned long ImageWidth)&lt;/P&gt;
&lt;P&gt;{&lt;/P&gt;
&lt;P&gt;unsigned long SumC = 0;&lt;/P&gt;
&lt;P&gt;const unsigned char *pSrc = FixedImg;&lt;/P&gt;
&lt;P&gt;for (unsigned long i=0; i&lt;IMAGEHEIGHT&gt;
&lt;/IMAGEHEIGHT&gt;&lt;/P&gt;&lt;P&gt;return SumC;&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;is vectorized, ASM of the core loop is below:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;.B8.4: ; Preds .B8.4 .B8.3&lt;/P&gt;
&lt;P&gt;$LN323:&lt;/P&gt;
&lt;P&gt;movd xmm2, DWORD PTR [eax+esi] ;254.58&lt;/P&gt;
&lt;P&gt;punpcklbw xmm2, xmm0 ;254.58&lt;/P&gt;
&lt;P&gt;punpcklwd xmm2, xmm0 ;254.58&lt;/P&gt;
&lt;P&gt;paddd xmm1, xmm2 ;254.58&lt;/P&gt;
&lt;P&gt;$LN325:&lt;/P&gt;
&lt;P&gt;add eax, 4 ;254.3&lt;/P&gt;
&lt;P&gt;cmp eax, edx ;254.3&lt;/P&gt;
&lt;P&gt;jb .B8.4 ; Prob 82% ;254.3&lt;/P&gt;
&lt;/SPAN&gt;&lt;/SPAN&gt;</description>
      <pubDate>Tue, 19 Jan 2010 17:58:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911251#M2970</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2010-01-19T17:58:12Z</dc:date>
    </item>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911252#M2971</link>
      <description>&lt;P&gt;try using&lt;/P&gt;
&lt;P&gt;pabsbw with zero to sum 8 horiz bytes to words X8 xn&lt;/P&gt;
&lt;P&gt;reduce with paddw until you have 8 signed words&lt;/P&gt;
&lt;P&gt;on 4 of those shift left by 16 or with the other 4 and apply pmaddws with 1,1 to get 4 dwords,&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;which you can sum with paddd&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;final stage is 2 phaddd&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2010 08:03:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911252#M2971</guid>
      <dc:creator>neni</dc:creator>
      <dc:date>2010-01-20T08:03:52Z</dc:date>
    </item>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911253#M2972</link>
      <description>&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=404840" class="basic" href="https://community.intel.com/en-us/profile/404840/"&gt;matthieu.darbois&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;
&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=458897" class="basic" href="https://community.intel.com/en-us/profile/458897/"&gt;joggingsonggmail.com&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;Hi,all&lt;BR /&gt;In order to boost performance, I choose &lt;BR /&gt;to program using SSE intrinsics. After &lt;BR /&gt;I measure execution time, I find that &lt;BR /&gt;the improvement is not significant, &lt;BR /&gt;only 20%. &lt;BR /&gt;&lt;BR /&gt;The original program calculate sum &lt;BR /&gt;sequentially, and SSE intrinsics program &lt;BR /&gt;calculate sum with four addition in parallel.&lt;BR /&gt; Several instructions are used for preparation&lt;BR /&gt;data in proper format, but it reduces the nubmer &lt;BR /&gt;of memory load by wide load instruction. &lt;BR /&gt;I expect the execution time is reduced&lt;BR /&gt;to at least 1/3. But the measured result &lt;BR /&gt;is very disappointing.&lt;BR /&gt;&lt;BR /&gt;How can I estimate the performance boost when&lt;BR /&gt;SSE intrinsics is used? Does porgramming using&lt;BR /&gt;SSE intrinsics require programmers know&lt;BR /&gt;the details of processor architecture well?&lt;BR /&gt;&lt;BR /&gt;The following is my code:&lt;BR /&gt; Start = GetCycleCount(); &lt;BR /&gt; SumC = 0;&lt;BR /&gt; pSrc = FixedImg;&lt;BR /&gt; for(i = 0; i &amp;lt; ImageHeight*ImageWidth; i++)&lt;BR /&gt; {&lt;BR /&gt; SumC += *pSrc++;&lt;BR /&gt; }&lt;BR /&gt; Stop = GetCycleCount(); &lt;BR /&gt; printf("Sum C cycle: %d\n", (Stop - Start)); &lt;BR /&gt; printf("SumC: %d\n", SumC);&lt;BR /&gt;&lt;BR /&gt; Start = GetCycleCount(); &lt;BR /&gt; SumSSE = 0;&lt;BR /&gt; {&lt;BR /&gt; __m128i Sum, Dat1, Dat2, Dat3;&lt;BR /&gt; __m128i vzero; &lt;BR /&gt;&lt;BR /&gt; pSrc = FixedImg;&lt;BR /&gt; vzero = _mm_setzero_si128(); &lt;BR /&gt; Sum = _mm_setzero_si128(); &lt;BR /&gt; for(i = 0; i &amp;lt; ImageHeight*ImageWidth/16; i++)&lt;BR /&gt; {&lt;BR /&gt; Dat1 = _mm_loadu_si128( (__m128i*)(pSrc));&lt;BR /&gt; Dat2 = _mm_unpacklo_epi8(Dat1, vzero);&lt;BR /&gt; Dat3 = _mm_unpacklo_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat3 = _mm_unpackhi_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi8(Dat1, vzero);&lt;BR /&gt; Dat3 = _mm_unpacklo_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; Dat3 = _mm_unpackhi_epi16(Dat2, vzero);&lt;BR /&gt; Sum = _mm_add_epi32(Sum, Dat3);&lt;BR /&gt; pSrc += 16;&lt;BR /&gt; }&lt;BR /&gt; Dat1 = _mm_unpacklo_epi32(Sum, vzero);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi32(Sum, vzero);&lt;BR /&gt; Sum = _mm_add_epi64(Dat1, Dat2);&lt;BR /&gt; Dat1 = _mm_unpacklo_epi64(Sum, vzero);&lt;BR /&gt; Dat2 = _mm_unpackhi_epi64(Sum, vzero);&lt;BR /&gt; Sum = _mm_add_epi64(Dat1, Dat2);&lt;BR /&gt;&lt;BR /&gt; SumSSE = _mm_cvtsi128_si32(Sum);&lt;BR /&gt; }&lt;BR /&gt; Stop = GetCycleCount(); &lt;BR /&gt; printf("Sum SSE cycle: %d\n", (Stop - Start));&lt;BR /&gt; printf("SumSSE: %d\n", SumSSE);&lt;BR /&gt;&lt;BR /&gt;Best Regards&lt;BR /&gt;Jogging&lt;/I&gt;&lt;/DIV&gt;
&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;You should try to get rid of the &lt;I&gt;_mm_loadu_si128&lt;/I&gt; instruction. On some architectures, it could have a significative  impact on performance. You should compute the sum for the N first bytes so that SSE code is applied only on aligned data using &lt;I&gt;_mm_load_si128. &lt;/I&gt;Then you can compute the remainder sum and add the 3.&lt;/P&gt;
&lt;P&gt;You can probably reduce the number of unpack instructions by doing 16 bits adds before unpacking to 32bits&lt;/P&gt;
&lt;P&gt;Matthieu&lt;/P&gt;
&lt;/I&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks, Mattieu&lt;BR /&gt;Using 16 bits addition is a good idea. A friend suggests it also.&lt;BR /&gt;When I write the intrinsics program at first, I try to avoid memory load. In the x86 processor, memory load has&lt;BR /&gt;latency even in the case of cache hit. But the x86 processors have out-of-order execution core, which can hide&lt;BR /&gt;the latency, so as few instructions as possible is expected in most cases.&lt;BR /&gt;&lt;BR /&gt;I have a question about the aligned memory load. In the current instruction set, it appears that only 128 bit load have two versions for unaligned and aligned memory load respectively. 64 bit load instruction movq don't care about alignment. &lt;BR /&gt;&lt;BR /&gt;Best Regards&lt;BR /&gt;Jogging&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2010 11:13:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911253#M2972</guid>
      <dc:creator>joggingsonggmail_com</dc:creator>
      <dc:date>2010-01-20T11:13:48Z</dc:date>
    </item>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911254#M2973</link>
      <description>&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=255766" class="basic" href="https://community.intel.com/en-us/profile/255766/"&gt;bronxzv&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;
&lt;P&gt;&lt;I&gt;How can I estimate the performance boost when&lt;BR /&gt;SSE intrinsics is used?&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;it depends on your compiler, if your code is vectorized by the compiler 20% by using intrinsics is quite good. =&amp;gt; post the ASM dump of the code generated by your compiler and tell us more about your target CPU (for example on Nehalem alignment is no big deal but it can hurt older CPUs like Pentium 4)&lt;/P&gt;
&lt;P&gt;for example with Intel C++ 11.1, this code :&lt;/P&gt;
&lt;SPAN style="font-family: Lucida Console; font-size: x-small;"&gt;&lt;SPAN style="font-family: Lucida Console; font-size: x-small;"&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;unsigned long Jogging (const unsigned char *FixedImg, unsigned long ImageHeight, unsigned long ImageWidth)&lt;/P&gt;
&lt;P&gt;{&lt;/P&gt;
&lt;P&gt;unsigned long SumC = 0;&lt;/P&gt;
&lt;P&gt;const unsigned char *pSrc = FixedImg;&lt;/P&gt;
&lt;P&gt;for (unsigned long i=0; i&lt;IMAGEHEIGHT&gt;
&lt;/IMAGEHEIGHT&gt;&lt;/P&gt;&lt;P&gt;return SumC;&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;is vectorized, ASM of the core loop is below:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;.B8.4: ; Preds .B8.4 .B8.3&lt;/P&gt;
&lt;P&gt;$LN323:&lt;/P&gt;
&lt;P&gt;movd xmm2, DWORD PTR [eax+esi] ;254.58&lt;/P&gt;
&lt;P&gt;punpcklbw xmm2, xmm0 ;254.58&lt;/P&gt;
&lt;P&gt;punpcklwd xmm2, xmm0 ;254.58&lt;/P&gt;
&lt;P&gt;paddd xmm1, xmm2 ;254.58&lt;/P&gt;
&lt;P&gt;$LN325:&lt;/P&gt;
&lt;P&gt;add eax, 4 ;254.3&lt;/P&gt;
&lt;P&gt;cmp eax, edx ;254.3&lt;/P&gt;
&lt;P&gt;jb .B8.4 ; Prob 82% ;254.3&lt;/P&gt;
&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/I&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;BR /&gt;The processor in my laptop is Core 2 Duo T7500. My current c compiler is from VS 2008.&lt;BR /&gt;The assembly code generated is below.&lt;BR /&gt;&lt;BR /&gt;for( i=0; i&lt;WIDTH&gt;&lt;/WIDTH&gt; {&lt;BR /&gt; iresult0 += *pSrc++; &lt;BR /&gt;00401066 movzx eax,byte ptr [ecx] &lt;BR /&gt;00401069 add dword ptr [iresult0],eax &lt;BR /&gt;0040106C inc ecx &lt;BR /&gt;0040106D sub dword ptr [ebp-10h],1 &lt;BR /&gt;00401071 jne main+66h (401066h) &lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;Though one byte is processed each time, its execution speed is pretty fast.&lt;BR /&gt;&lt;BR /&gt;Regards&lt;/P&gt;
&lt;P&gt;Jogging&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2010 11:34:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911254#M2973</guid>
      <dc:creator>joggingsonggmail_com</dc:creator>
      <dc:date>2010-01-20T11:34:57Z</dc:date>
    </item>
    <item>
      <title>Performance boost is not as expected using SSE intrinsics</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911255#M2974</link>
      <description>&lt;P&gt;the VS 2008code looks so horrible that it's very strange that your version is only 20% faster, maybe you are memory bandwidth bound (i.e. your image don't fit in L2 cache) ?&lt;/P&gt;
&lt;P&gt;now, as hinted by another poster, the best is probably to use PSADBW (_mm_sad_epu8) to achieve 16 additions in a single instruction, then cumul the packed results with PADDW (_mm_add_epi32)&lt;/P&gt;
&lt;P&gt;at the end you'll get something like : 0 | partialsum2 | 0 | partialsum1, then just add elt 2 to elt 0&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;the best for your experiments is probably to work first with a special test image: allocated with 16B alignment so that you can use aligned moves and not too big so that it can fit in the L2 cache, thenwhen your speedups are OK, generalize to the unaligned case&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2010 18:09:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-boost-is-not-as-expected-using-SSE-intrinsics/m-p/911255#M2974</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2010-01-20T18:09:58Z</dc:date>
    </item>
  </channel>
</rss>

