<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic What I suspect to be in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028697#M6652</link>
    <description>&lt;P&gt;What I suspect to be happening is, in the 4x test case, the array LMASK is sparsely (or at least not densely)&amp;nbsp;populated with .true..&lt;/P&gt;

&lt;P&gt;Under this circumstance, the code may have been performing masked load/store operations where the entire mask is .false..&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Wed, 29 Jul 2015 14:47:50 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2015-07-29T14:47:50Z</dc:date>
    <item>
      <title>Unexplained speedup how?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028695#M6650</link>
      <description>&lt;P&gt;
	&lt;STYLE type="text/css"&gt;pre.cjk { font-family: "Droid Sans Fallback",monospace; }p { margin-bottom: 0.1in; line-height: 120%; }
	&lt;/STYLE&gt;
&lt;/P&gt;

&lt;P&gt;Hi all,&lt;/P&gt;

&lt;P&gt;I have got a speedup of approximately 4x with just rewriting a few where statements in my loop to explicit do loops as illustrated below.&lt;/P&gt;

&lt;P&gt;Unchanged code&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;where ( LMASK )

            WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &amp;amp;
                           * SLX(:,:,kk,kbt,k,bid) * dz(k)
            WORK2(:,:,kk) = c2 * dzwr(k) * ( WORK1(:,:,kk)            &amp;amp;
              - KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) &amp;amp;
                                            * dz(k+1) )

            WORK2_NEXT = c2 * ( &amp;amp;
              KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) - &amp;amp;
              KAPPA_THIC(:,:,kbt,k+1,bid) * SLX(:,:,kk,kbt,k+1,bid) )

            WORK3(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &amp;amp;
                           * SLY(:,:,kk,kbt,k,bid) * dz(k)
            WORK4(:,:,kk) = c2 * dzwr(k) * ( WORK3(:,:,kk)            &amp;amp;
              - KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) &amp;amp;
                                            * dz(k+1) )

            WORK4_NEXT = c2 * ( &amp;amp;
              KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) - &amp;amp;
              KAPPA_THIC(:,:,kbt,k+1,bid) * SLY(:,:,kk,kbt,k+1,bid) )

          endwhere&lt;/PRE&gt;

&lt;P&gt;Changed code&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;do j=1,ny_block
           do i=1,nx_block

            if ( LMASK(i,j) ) then

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &amp;amp;
                           * SLX(i,j,kk,kbt,k,bid) * dz(k)

            WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk)            &amp;amp;
              - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &amp;amp;
                                            * dz(k+1) )

            WORK2_NEXT(i,j) = c2 * ( &amp;amp;
              KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &amp;amp;
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )

            WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &amp;amp;
                           * SLY(i,j,kk,kbt,k,bid) * dz(k)

            WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk)            &amp;amp;
              - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &amp;amp;
                                            * dz(k+1) )

            WORK4_NEXT(i,j) = c2 * ( &amp;amp;
              KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &amp;amp;
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )

            endif

            enddo
          enddo&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	We are unable to try Vtune as the code ran for eternity, like 1 day. Also no info as to why it showed such high speedup was not explained. Just a few red bars with time taken. They were in agreement to the time showed by omp timers added b/w the loops.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The compilation of unchanged code was with -O3 while changed code was with -O2. I can clearly rule out loop fusion as a reason(fusion of loops of work1, work2 which are hidden in : style language). As per the opt-report loops we fused for original code.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I can guarantee no Openmp was implemented. All flags are similar in both cases(except O3).Any explanation why is speedup as high as 4X.&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;UPDATE&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;Hi I did check the optrpt and found the following for unchanged and changed code&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;unchanged code&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15448: unmasked aligned unit stride loads: 13&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15449: unmasked aligned unit stride stores: 3&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15450: unmasked unaligned unit stride loads: 3&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15455: masked aligned unit stride stores: 6&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15456: masked unaligned unit stride loads: 16&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;changed code&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15448: unmasked aligned unit stride loads: 1&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15454: masked aligned unit stride loads: 2&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15455: masked aligned unit stride stores: 6&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15456: masked unaligned unit stride loads: 16&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin-bottom: 0in; line-height: 100%"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Jul 2015 06:02:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028695#M6650</guid>
      <dc:creator>aketh_t_</dc:creator>
      <dc:date>2015-07-28T06:02:28Z</dc:date>
    </item>
    <item>
      <title>It does look like the fusion</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028696#M6651</link>
      <description>&lt;P&gt;It does look like the fusion of your WHERE version is incomplete, possibly as a consequence of the rank 2 array assignments, or possibly because that style is not so frequently used in critical performance situations which the compiler has been trained to optimize.&amp;nbsp; You must recognize that the syntax of WHERE requires the compiler to start out by distributing the conditional to each individual assignment, so there is a lot more work to be done to get back to full sharing of operands by loop fusion.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Jul 2015 12:15:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028696#M6651</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-07-28T12:15:00Z</dc:date>
    </item>
    <item>
      <title>What I suspect to be</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028697#M6652</link>
      <description>&lt;P&gt;What I suspect to be happening is, in the 4x test case, the array LMASK is sparsely (or at least not densely)&amp;nbsp;populated with .true..&lt;/P&gt;

&lt;P&gt;Under this circumstance, the code may have been performing masked load/store operations where the entire mask is .false..&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jul 2015 14:47:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Unexplained-speedup-how/m-p/1028697#M6652</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2015-07-29T14:47:50Z</dc:date>
    </item>
  </channel>
</rss>

