<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic   in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Will-Intel-Intrinsics-really-help-here/m-p/1005221#M3649</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Can you perform VTune analysis of both test cases and post the screenshots?&lt;/P&gt;</description>
    <pubDate>Thu, 01 May 2014 17:51:45 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2014-05-01T17:51:45Z</dc:date>
    <item>
      <title>Will Intel Intrinsics really help here?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Will-Intel-Intrinsics-really-help-here/m-p/1005220#M3648</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I have a short 64 bit comparison in the loop below. Basically an operational AND accumlation, followed by a first set bit. Typically the outer look runs for ~400 iterations, while the inner loop 12 (_num_tables). If I replace the 64 bit operation with the intrinsics for 128 bit operations (and reduce the outer loop iteraation by 2 to ~200). The intrinsics performance drops by about 35% compared to the 64 bit case. This is all on the latest hardware, compiled -O3 etc.&lt;/P&gt;

&lt;P&gt;There is no one line that appears to be the offender performance-wise in the intrinsic version. I'm curious is there anything stupid that I'm doing in the 128 bit version that jumps out as an obvious performance no-no?&lt;/P&gt;

&lt;P&gt;Thanks for any advice!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;        /* for each chunk of rules, i.e. 64 at a time */
        unsigned int end = conf-&amp;gt;_num_chunks * 2;
        for (j = 0; j &amp;lt; end; ++j,++j) {
                long int rule_match = 0xFFFFFFFFFFFFFFFF;
                /* For each table */
                for (i = 0; i &amp;lt; conf-&amp;gt;_num_tables; ++i) {
                        rule_match &amp;amp;= *((long int*)(conf-&amp;gt;_match_table&lt;I&gt;[ packet&lt;I&gt; ] + j));
                        if (!rule_match)
                                goto next; /* don't need to proceed, no match */
                }
                return ffsl(rule_match);
        next: ;
        }
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;128 bit intrinsic version below:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;        /* for each chunk of rules, i.e. 128 at a time */
        for (j = 0; j &amp;lt; conf-&amp;gt;_num_chunks2; ++j) {
                /* initial 128 bit wide value */
                __m128i rule_match_128 = max;
                unsigned short jump = j * 4;
                /* For each table */
                for (i = 0; i &amp;lt; conf-&amp;gt;_num_tables; ++i) {
                        uint8_t seg = packet&lt;I&gt;;
                        /* copy 128 bit index into comparison */
                        __m128i *match_128 = (__m128i*)(conf-&amp;gt;_match_table&lt;I&gt;[seg] + jump);
                        /* perform &amp;amp;= on 128 bit wide comparison */
                        rule_match_128 = _mm_and_si128(rule_match_128, *match_128);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(rule_match_128, zero)) == 65535)
        			goto next;
                }

        	/* Only returning first match for now */
                for (i = 0; i &amp;lt; 128; ++i) {
                        __m128i cmp = _mm_and_si128(rule_match_128, lut&lt;I&gt;);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(cmp,zero)) != 65535) {
				return i;
                        }
		}
        next: ;
        }
&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2014 20:57:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Will-Intel-Intrinsics-really-help-here/m-p/1005220#M3648</guid>
      <dc:creator>Michael_L_</dc:creator>
      <dc:date>2014-04-30T20:57:25Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Will-Intel-Intrinsics-really-help-here/m-p/1005221#M3649</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Can you perform VTune analysis of both test cases and post the screenshots?&lt;/P&gt;</description>
      <pubDate>Thu, 01 May 2014 17:51:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Will-Intel-Intrinsics-really-help-here/m-p/1005221#M3649</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-05-01T17:51:45Z</dc:date>
    </item>
  </channel>
</rss>

