<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimizing indirect memory read and write  access on MIC   in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Optimizing-indirect-memory-read-and-write-access-on-MIC/m-p/1006128#M32201</link>
    <description>&lt;P&gt;Extended addition (described below ) is one of the most performance critical kernel in our code (that implements important functions from sparse linear algebra). &amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;AddEx(double* A, int LDA, double *B, int LDB, int*C, ...)
{
    for (jj = 0; jj &amp;lt; col; ++jj)
    {
        /*if jj segment is not empty */
        if (seg[jj]) 
        {

            for (i = 0; i &amp;lt; row; ++i)
            {

                A[C&lt;I&gt;] -= B&lt;I&gt;;

            }
            B += LDB;
        }
        A += LDA;
    } 

}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Compared, to something like SpMV ( which reads from indirect addresses), this both reads and write to indirect memory addresses&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;In general, we perform &amp;nbsp;a number of &amp;nbsp;extended additions operations on independent blocks A_{i} and B_{i} concurrently using openMP parallel for.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Assuming A[C&lt;I&gt;] -= B&lt;I&gt; takes 2 reads and 1 write ( C&lt;I&gt; is assumed to be in cache ) , so in total 3*row*col memory ops, measuring time shows low bandwidth of around 52 GB/sec obtained. While there can be load imbalance among openMP threads, but still 52 GB/sec is quite low. I seek suggestions to improve it. My experience with SIMD instructions is limited, however, I tried as follows.&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;I took the inner loop&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;for (i = 0; i &amp;lt; row; ++i)
{

    A[C&lt;I&gt;] -= B&lt;I&gt;;

}
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;PRE class="brush:;"&gt;&lt;SPAN style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em; line-height: 1.5;"&gt;and replaced it with following SIMDized loop&amp;nbsp;&lt;/SPAN&gt;
&lt;/PRE&gt;

&lt;PRE class="brush:cpp;"&gt;            __m512i v_rel;
            __m512d v_A;
            __m512d v_B;
            __mmask8 mask;

            int row_8 = row/8*8;
            for (i = 0; i &amp;lt; row_8; i += 8)
            {
                v_rel = _mm512_extloadunpacklo_epi32 (v_rel, &amp;amp;C&lt;I&gt;,
                                                      _MM_UPCONV_EPI32_NONE, _MM_HINT_NT);
                v_rel = _mm512_extloadunpackhi_epi32 (v_rel, &amp;amp;C[i + 16],
                                                      _MM_UPCONV_EPI32_NONE, _MM_HINT_NT);
                v_B = _mm512_extloadunpacklo_pd (v_B, &amp;amp;B&lt;I&gt;,
                                                     _MM_UPCONV_PD_NONE, _MM_HINT_NT);
                v_B = _mm512_extloadunpackhi_pd (v_B, &amp;amp;B[i + 8],
                                                     _MM_UPCONV_PD_NONE, _MM_HINT_NT);
                v_A = _mm512_i32logather_pd (v_rel, A, _MM_SCALE_8);
                v_A = _mm512_sub_pd (v_A, v_B);
                _mm512_i32loscatter_pd (A, v_rel, v_A, _MM_SCALE_8);
            }
            
            /*handling remainders*/
            .... 
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;The performance improves using SIMD and now it is around 60 GB/sec. There seems to be plently of room for improvement.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Upper bound on sizes of row and col is ~128 (defined by user ) (while LDA and LDB can be large, only a continues portion of length upto 128 of B maps to upto 128 contgeous portion of A). The source vector B has fewer rows than A. In general C&lt;I&gt; is not &amp;nbsp;monotonous (i.e. C&lt;I&gt;&amp;gt;=C&lt;J&gt; if i&amp;gt;j does not hold) .&amp;nbsp;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 02 May 2014 16:20:01 GMT</pubDate>
    <dc:creator>piyush_s_</dc:creator>
    <dc:date>2014-05-02T16:20:01Z</dc:date>
    <item>
      <title>Optimizing indirect memory read and write  access on MIC</title>
      <link>https://community.intel.com/t5/Software-Archive/Optimizing-indirect-memory-read-and-write-access-on-MIC/m-p/1006128#M32201</link>
      <description>&lt;P&gt;Extended addition (described below ) is one of the most performance critical kernel in our code (that implements important functions from sparse linear algebra). &amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;AddEx(double* A, int LDA, double *B, int LDB, int*C, ...)
{
    for (jj = 0; jj &amp;lt; col; ++jj)
    {
        /*if jj segment is not empty */
        if (seg[jj]) 
        {

            for (i = 0; i &amp;lt; row; ++i)
            {

                A[C&lt;I&gt;] -= B&lt;I&gt;;

            }
            B += LDB;
        }
        A += LDA;
    } 

}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Compared, to something like SpMV ( which reads from indirect addresses), this both reads and write to indirect memory addresses&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;In general, we perform &amp;nbsp;a number of &amp;nbsp;extended additions operations on independent blocks A_{i} and B_{i} concurrently using openMP parallel for.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Assuming A[C&lt;I&gt;] -= B&lt;I&gt; takes 2 reads and 1 write ( C&lt;I&gt; is assumed to be in cache ) , so in total 3*row*col memory ops, measuring time shows low bandwidth of around 52 GB/sec obtained. While there can be load imbalance among openMP threads, but still 52 GB/sec is quite low. I seek suggestions to improve it. My experience with SIMD instructions is limited, however, I tried as follows.&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;I took the inner loop&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;for (i = 0; i &amp;lt; row; ++i)
{

    A[C&lt;I&gt;] -= B&lt;I&gt;;

}
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;PRE class="brush:;"&gt;&lt;SPAN style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em; line-height: 1.5;"&gt;and replaced it with following SIMDized loop&amp;nbsp;&lt;/SPAN&gt;
&lt;/PRE&gt;

&lt;PRE class="brush:cpp;"&gt;            __m512i v_rel;
            __m512d v_A;
            __m512d v_B;
            __mmask8 mask;

            int row_8 = row/8*8;
            for (i = 0; i &amp;lt; row_8; i += 8)
            {
                v_rel = _mm512_extloadunpacklo_epi32 (v_rel, &amp;amp;C&lt;I&gt;,
                                                      _MM_UPCONV_EPI32_NONE, _MM_HINT_NT);
                v_rel = _mm512_extloadunpackhi_epi32 (v_rel, &amp;amp;C[i + 16],
                                                      _MM_UPCONV_EPI32_NONE, _MM_HINT_NT);
                v_B = _mm512_extloadunpacklo_pd (v_B, &amp;amp;B&lt;I&gt;,
                                                     _MM_UPCONV_PD_NONE, _MM_HINT_NT);
                v_B = _mm512_extloadunpackhi_pd (v_B, &amp;amp;B[i + 8],
                                                     _MM_UPCONV_PD_NONE, _MM_HINT_NT);
                v_A = _mm512_i32logather_pd (v_rel, A, _MM_SCALE_8);
                v_A = _mm512_sub_pd (v_A, v_B);
                _mm512_i32loscatter_pd (A, v_rel, v_A, _MM_SCALE_8);
            }
            
            /*handling remainders*/
            .... 
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;The performance improves using SIMD and now it is around 60 GB/sec. There seems to be plently of room for improvement.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Upper bound on sizes of row and col is ~128 (defined by user ) (while LDA and LDB can be large, only a continues portion of length upto 128 of B maps to upto 128 contgeous portion of A). The source vector B has fewer rows than A. In general C&lt;I&gt; is not &amp;nbsp;monotonous (i.e. C&lt;I&gt;&amp;gt;=C&lt;J&gt; if i&amp;gt;j does not hold) .&amp;nbsp;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 02 May 2014 16:20:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Optimizing-indirect-memory-read-and-write-access-on-MIC/m-p/1006128#M32201</guid>
      <dc:creator>piyush_s_</dc:creator>
      <dc:date>2014-05-02T16:20:01Z</dc:date>
    </item>
    <item>
      <title>If the prefetch instructions</title>
      <link>https://community.intel.com/t5/Software-Archive/Optimizing-indirect-memory-read-and-write-access-on-MIC/m-p/1006129#M32202</link>
      <description>If the prefetch instructions which presumably were generated in your C code loop weren't helping performance, does it mean that you aren't incurring cache misses, or that they weren't generated effectively?  Your hint indicates that maybe the loop isn't big enough for prefetch within the loop to help (even if you follow the advice about indirect prefetch)
&lt;A href="https://www.google.com/url?q=https://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-5.pdf&amp;amp;sa=U&amp;amp;ei=Rc1jU_-iE8-lyATAnIKoAQ&amp;amp;ved=0CCAQFjAB&amp;amp;sig2=ekDzB0nMI-kVyOncQjCy-w&amp;amp;usg=AFQjCNFTgB-iU0HO-1AS1RpGcB0UEUeS4Q" target="_blank"&gt;https://www.google.com/url?q=https://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-5.pdf&amp;amp;sa=U&amp;amp;ei=Rc1jU_-iE8-lyATAnIKoAQ&amp;amp;ved=0CCAQFjAB&amp;amp;sig2=ekDzB0nMI-kVyOncQjCy-w&amp;amp;usg=AFQjCNFTgB-iU0HO-1AS1RpGcB0UEUeS4Q&lt;/A&gt;
, and some tinkering with initial prefetch may be needed.  I don't know whether loop count directives should influence the C compiler.  If it's worth the effort, you might try prefetching individually all the operands for a short loop before entering the loop.
I assume the gather and scatter intrinsics involve looping over the individual cache lines.  So the peak performance depends on how many separate cache lines are required for each iteration of the programmed loop. 
VTune KNC general analysis should help to show whether the performance limitation occurs at some level of cache and whether most of the time is spent in the gather and scatter.</description>
      <pubDate>Fri, 02 May 2014 16:50:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Optimizing-indirect-memory-read-and-write-access-on-MIC/m-p/1006129#M32202</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-02T16:50:00Z</dc:date>
    </item>
  </channel>
</rss>

