<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic   in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968866#M96534</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;JD,&lt;/P&gt;

&lt;P&gt;The i subscript was intentionally put last because the problem&lt;/P&gt;

&lt;P&gt;is decoupled with respect to this index for the quite a lot of the code.&lt;/P&gt;

&lt;P&gt;So there are a number of places alsewhere where the omp loop is over&lt;/P&gt;

&lt;P&gt;the last index i. This makes the whole calculation quite efficient elsewhere&lt;/P&gt;

&lt;P&gt;in the code and thus this bit now stands out. I may try some copying into&lt;/P&gt;

&lt;P&gt;temporary workarrays to see if things improve.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I tried the -x option of ifort with SSE4.1 and avx with no perceptible&lt;/P&gt;

&lt;P&gt;change in the cpu timings.&lt;/P&gt;

&lt;P&gt;Thanks.&lt;/P&gt;

&lt;P&gt;--&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 02 Apr 2014 21:05:09 GMT</pubDate>
    <dc:creator>a_b_1</dc:creator>
    <dc:date>2014-04-02T21:05:09Z</dc:date>
    <item>
      <title>Can this be made better?</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968862#M96530</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I wonder is anyone has the time and inclination to have look at the code below for&lt;/P&gt;

&lt;P&gt;any possible improvements.&lt;/P&gt;

&lt;P&gt;The extract included here is the the heaviest user of cpu in a large-ish simulation code .&lt;/P&gt;

&lt;P&gt;A typical run would take 6-9 months of running 24/24 and 7/7 with 6 threads on six cores.&lt;/P&gt;

&lt;P&gt;The omp part is working very well and there cannot be much inprovement with the multithreading part.&lt;/P&gt;

&lt;P&gt;The compiler call used for the whole code is:&lt;/P&gt;

&lt;P&gt;ifort -O3 -r8&amp;nbsp; -openmp -fpp -parallel -mcmodel=medium -i-dynamic -shared-intel&lt;/P&gt;

&lt;P&gt;Would there be a benefit if part or all of it were written in assembler?&lt;/P&gt;

&lt;P&gt;Lots and lots of thanks for any suggestions.&lt;/P&gt;

&lt;P&gt;--&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;! typical values

! N1 = 768
! N2 = N3 = 12
! M3 = M2 = 42

  Do KEL = 1, N3
  Do JEL = 1, N2

...
[address calculations]

!$OMP  PARALLEL DEFAULT(SHARED) PRIVATE( I, J, K, JA, KA, JJ, KK )
!$OMP DO
   Do K = 1, M3
      Do J = 1, M2
         JJ = (J-1)*NX32

! - copy into work arrays for later fft.

         Do I = 1, N1
            WK_1( JJ+I, K ) = U( J_Jump+J, K_Jump+K, I )
            WK_2( JJ+I, K ) = V( J_Jump+J, K_Jump+K, I )
            WK_3( JJ+I, K ) = W( J_Jump+J, K_Jump+K, I )
         End Do

         Do I = 1, N1, 2
! - du/dx
            WKX_1( JJ+I,   K ) = -Wv(i)*U( J_Jump+J, K_Jump+K, I+1 )
            WKX_1( JJ+I+1, K ) =  Wv(i)*U( J_Jump+J, K_Jump+K, I   )
! - dv/dx
            WKX_2( JJ+I,   K ) = -Wv(i)*V( J_Jump+J, K_Jump+K, I+1 )
            WKX_2( JJ+I+1, K ) =  Wv(i)*V( J_Jump+J, K_Jump+K, I   )
! - dw/dx
            WKX_3( JJ+I,   K ) = -Wv(i)*W( J_Jump+J, K_Jump+K, I+1 )
            WKX_3( JJ+I+1, K ) =  Wv(i)*W( J_Jump+J, K_Jump+K, I   )

	 End Do

! - Y derivatives.

         Do JA =  1, M2
            Do I = 1, N1
               WK_4( JJ+I, K ) = WK_4( JJ+I, K ) + RDY*DYGL(J,JA)*U( J_jump+JA, K_jump+K, I )  ! du/dy
               WK_5( JJ+I, K ) = WK_5( JJ+I, K ) + RDY*DYGL(J,JA)*V( J_jump+JA, K_jump+K, I )  ! dv/dy
               WK_6( JJ+I, K ) = WK_6( JJ+I, K ) + RDY*DYGL(J,JA)*W( J_jump+JA, K_jump+K, I )  ! dw/dy
            End Do
         End Do

! - Z derivatives.

         Do KA = 1, M3
            Do I = 1, N1
               WK_7( JJ+I, K ) = WK_7( JJ+I, K ) + RDZ*DZGL(K,KA)*U( J_jump+J, K_jump+KA, I )   ! du/dz
               WK_8( JJ+I, K ) = WK_8( JJ+I, K ) + RDZ*DZGL(K,KA)*V( J_jump+J, K_jump+KA, I )   ! dv/dz
               WK_9( JJ+I, K ) = WK_9( JJ+I, K ) + RDZ*DZGL(K,KA)*W( J_jump+J, K_jump+KA, I )   ! dw/dz
            End Do
         End Do

      End Do
   End Do   ! eo single element loop.
!$OMP END DO
!$OMP END PARALLEL

...
[other stuff]

end do
end do
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2014 07:43:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968862#M96530</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2014-04-02T07:43:08Z</dc:date>
    </item>
    <item>
      <title>I suppose you would require</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968863#M96531</link>
      <description>&lt;P&gt;I suppose you would require selection of sse4.1 or avx to see vectorization in the opt-report. If compiler doesn't accept unroll and jam directives at O3 you may need to write it explicitly and use an event profiler to see if you are alleviating poor cache behavior.&lt;/P&gt;

&lt;P&gt;My best results with unroll and jam came with specified number:&lt;/P&gt;

&lt;P&gt;!dir$ unroll_and_jam = 2&lt;/P&gt;

&lt;P&gt;You would put the directive ahead of the JA, and KA loops. &amp;nbsp;This should replace the unrolling which the compiler normally does on the inner loop only (which could be eliminated by putting !dir$ unroll(0) there). &amp;nbsp;The compiler would tell you in opt-report what unrolling was actually chosen.&lt;/P&gt;

&lt;P&gt;If 2 works, you would need to test whether a larger number (including whatever the compiler might choose) is better for a cut down version of your application. &amp;nbsp;For VTune profiling (with one of the L1 and L2 cache analysis choices), you probably need a case which runs just a few minutes without enabling multiple runs.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2014 08:06:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968863#M96531</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-04-02T08:06:00Z</dc:date>
    </item>
    <item>
      <title>Thank you.</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968864#M96532</link>
      <description>&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;With or without the unroll_and_jam directive the -opt-report gives nothing for&lt;/P&gt;

&lt;P&gt;the lines concerned.&lt;/P&gt;

&lt;P&gt;Anyway, I found that&lt;/P&gt;

&lt;P&gt;!dir$ unroll_and_jam = 4&lt;/P&gt;

&lt;P&gt;produces about 4-5% reduction in the cpu for the whole code. Not a bad start.&lt;/P&gt;

&lt;P&gt;I will follow the rest of your suggestions and see what comes up.&lt;/P&gt;

&lt;P&gt;I am also trying to switch to fftw with weird things happening, but this&lt;/P&gt;

&lt;P&gt;is another forum - i know.&lt;/P&gt;

&lt;P&gt;--&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2014 13:45:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968864#M96532</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2014-04-02T13:45:37Z</dc:date>
    </item>
    <item>
      <title>Can you reorganize your U,V,W</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968865#M96533</link>
      <description>&lt;P&gt;Can you reorganize your U,V,W arrays such that the last index is rearranged to be first?&lt;/P&gt;

&lt;P&gt;Alternately, if U,V,W arrays are relatively fixed through long compute sections, then consider shadowing U,V,W with Uotherway, Votherway, Wotherway that the I index first (you can pick different annotation for "otherway"). Only update the shadows when necessary.&lt;/P&gt;

&lt;P&gt;Making the above change will improve vectorization opportunities (at the expense of making the copy).&lt;/P&gt;

&lt;P&gt;Jim Demspey&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2014 17:42:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968865#M96533</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-04-02T17:42:54Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968866#M96534</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;JD,&lt;/P&gt;

&lt;P&gt;The i subscript was intentionally put last because the problem&lt;/P&gt;

&lt;P&gt;is decoupled with respect to this index for the quite a lot of the code.&lt;/P&gt;

&lt;P&gt;So there are a number of places alsewhere where the omp loop is over&lt;/P&gt;

&lt;P&gt;the last index i. This makes the whole calculation quite efficient elsewhere&lt;/P&gt;

&lt;P&gt;in the code and thus this bit now stands out. I may try some copying into&lt;/P&gt;

&lt;P&gt;temporary workarrays to see if things improve.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I tried the -x option of ifort with SSE4.1 and avx with no perceptible&lt;/P&gt;

&lt;P&gt;change in the cpu timings.&lt;/P&gt;

&lt;P&gt;Thanks.&lt;/P&gt;

&lt;P&gt;--&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2014 21:05:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968866#M96534</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2014-04-02T21:05:09Z</dc:date>
    </item>
    <item>
      <title>If -xsse4.1 doesn't trigger</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968867#M96535</link>
      <description>If -xsse4.1 doesn't trigger vectorization you might try !dir$ simd on those loops.  It means the compiler didn't rate vectorization as useful but you due to the large stride operand but you may as well try it.</description>
      <pubDate>Wed, 02 Apr 2014 22:25:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968867#M96535</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-04-02T22:25:25Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968868#M96536</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I have tested the JD suggestion of carrying two copies of the big arrays&lt;/P&gt;

&lt;P&gt;with reverse indexing i.e. I have now&amp;nbsp; both uri( i, j, k ) = u(&amp;nbsp; j, k, i ) etc.&lt;/P&gt;

&lt;P&gt;On old&amp;nbsp; Q9650 (4 cores 4 threads) the overall reduction in cpu usage is&lt;STRONG&gt; 8%&lt;/STRONG&gt;.&lt;/P&gt;

&lt;P&gt;On the newer i7-3090k (6 cores &amp;amp; 6 threads) overall code cpu usage dropped&lt;/P&gt;

&lt;P&gt;by &lt;STRONG&gt;33%&lt;/STRONG&gt; (no kidding).&lt;/P&gt;

&lt;P&gt;The !dir$ unroll_and_jam directive fiddling gives consistent &lt;STRONG&gt;4%&lt;/STRONG&gt; reduction in cpu.&lt;/P&gt;

&lt;P&gt;These are quite good results with an increase of 33% in the usage of&lt;/P&gt;

&lt;P&gt;ram (+1.4GB for my current case).&lt;/P&gt;

&lt;P&gt;There are few more things to try.&lt;/P&gt;

&lt;P&gt;Now, what stands out as the most cpu intensive part is the ffts.&lt;/P&gt;

&lt;P&gt;I am trying to switch to fftw but so far this has messed up&lt;/P&gt;

&lt;P&gt;completely the whole code.&lt;/P&gt;

&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;--&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Apr 2014 10:53:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968868#M96536</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2014-04-03T10:53:00Z</dc:date>
    </item>
    <item>
      <title>33% improvement is a good</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968869#M96537</link>
      <description>&lt;P&gt;33% improvement is&amp;nbsp;a good return on a little bit of time invested in making this change.&lt;/P&gt;

&lt;P&gt;In some cases you will want to organize your data, paying&amp;nbsp;particular attention to the dimension to the dimension index with respect to vectorization. IOW which order will improve performance by improving vectorization.&lt;/P&gt;

&lt;P&gt;In other cases part of the code may benefit with one dimension order while a different part of the code benefits best with a different order. When the order is non-flippant (one/either order used multiple times in a row before other order), then I incorporate a flag indicating which array has the most recent representation of the data (0 = a same as b, 1 = a more recent than b, -1 = b more recent than a), . When the most recent copy is not the one I wish to use then a copy is made and flag set to 0 if the procedure is read-only or 1/-1 for modification and depending on which array being modified. This will cut down on unnecessary copy/transformations.&lt;/P&gt;

&lt;P&gt;FWIW I also used this technique when an array could reside in CPU memory or GPU memory (or both).&lt;/P&gt;

&lt;P&gt;Considering that SIMD vectors are getting wider and wider, more care is require in designing data layouts. Discount scatter/gather as that will only reduce instruction count and not store/fetch cycles. Scatter and gather are beneficial for infrequent accesses of those data vectors with respect to being used in combination with other non-gathered vectors.&lt;/P&gt;

&lt;P&gt;You will tend to find&amp;nbsp;that the number of gripes about poor vectorization are solvable by reorganization of data.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 03 Apr 2014 15:01:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Can-this-be-made-better/m-p/968869#M96537</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-04-03T15:01:00Z</dc:date>
    </item>
  </channel>
</rss>

