<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimisation of Do Loops in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823623#M49000</link>
    <description>I have been looking at the response of old style DO loops to optimisation. I have found one case which performs opposite to what is expected.&lt;BR /&gt;I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:&lt;BR /&gt;&lt;BR /&gt;1) Including a conventional DO loop in a block of code.&lt;BR /&gt;[bash]            c = 0 
            do k = JEQ_bot,J-1 
               c = c + A(A_ptr_0+k) * B(B_ptr_0+k) 
            end do 
            A(A_ptr_0+J) = A(A_ptr_0+J) - c 
[/bash]&lt;P&gt;&lt;BR /&gt;2) Converting the DO Loop to Dot_Product, using array sections.&lt;/P&gt;[bash]            A_ptr_t = A_ptr_b + JBAND - 1 
            B_ptr_t = B_ptr_b + JBAND - 1 
            A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) 
[/bash]&lt;P&gt;&lt;BR /&gt;3) Wrapping the DO loop into a F77 style function&lt;/P&gt;[bash]            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum   (A(A_ptr_b), B(B_ptr_b), JBAND) 
where
      REAL*8 FUNCTION VEC_SUM (A, B, N) 
! 
      integer*4,               intent (in) :: n 
      real*8,    dimension(n), intent (in) :: a 
      real*8,    dimension(n), intent (in) :: b 
!
      real*8    c 
      integer*4 k 
! 
      c = 0 
      do k = 1,N 
         c = c + A(k) * B(k) 
      end do 
      vec_sum = c 
! 
      RETURN 
! 
      END 
[/bash]&lt;P&gt;&lt;BR /&gt;4) Wrapping the Dot_Product into a F77 style function, to avoid array sections.&lt;/P&gt;[bash]            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND) 
where
      REAL*8 FUNCTION VEC_SUM_d (A, B, N) 
! 
      integer*4,               intent (in) :: n 
      real*8,    dimension(n), intent (in) :: a 
      real*8,    dimension(n), intent (in) :: b 
! 
      vec_sum_d = dot_product ( a, b ) 
! 
      RETURN 
! 
      END 
[/bash]&lt;BR /&gt;5) Modified the DO loop with subscripts as temproary variables&lt;BR /&gt;[bash]            c = 0 
            do k = JEQ_bot,J-1 
               ia = A_ptr_0+k 
               ib = B_ptr_0+k 
               c = c + A(ia) * B(ib) 
            end do 
            A(A_ptr_0+J) = A(A_ptr_0+J) - c 
[/bash]&lt;BR /&gt;6) Modified DO loop for temporary subscripts as forauto increment&lt;BR /&gt;[bash]            c = 0 
            do k = JEQ_bot,J-1 
               c = c + A(A_ptr_b) * B(B_ptr_b) 
               A_ptr_b = A_ptr_b+1 
               B_ptr_b = B_ptr_b+1 
            end do 
            A(A_ptr_b) = A(A_ptr_b) - c 
[/bash]&lt;BR /&gt;A and B are real*8 vectors and all subscripts are integer*4&lt;BR /&gt;&lt;BR /&gt;I have tested these on a Xeon processor with compiler options:&lt;BR /&gt; /o1, /o2 or /o3 (/o2 is default)&lt;BR /&gt; /Qvect or /Qvect- (/Qvect is default)&lt;BR /&gt; /QxHost&lt;BR /&gt;&lt;BR /&gt;All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.&lt;BR /&gt;&lt;BR /&gt;However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.&lt;BR /&gt;&lt;BR /&gt;I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.&lt;BR /&gt;&lt;BR /&gt;Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.&lt;BR /&gt;&lt;BR /&gt;From this I conclude that for ifort:&lt;BR /&gt;/QxHost should not be used and&lt;BR /&gt;I should review my use of libraries of common calculations.&lt;BR /&gt;&lt;BR /&gt;Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?&lt;BR /&gt;&lt;BR /&gt;John</description>
    <pubDate>Mon, 14 May 2012 05:32:19 GMT</pubDate>
    <dc:creator>John_Campbell</dc:creator>
    <dc:date>2012-05-14T05:32:19Z</dc:date>
    <item>
      <title>Optimisation of Do Loops</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823623#M49000</link>
      <description>I have been looking at the response of old style DO loops to optimisation. I have found one case which performs opposite to what is expected.&lt;BR /&gt;I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:&lt;BR /&gt;&lt;BR /&gt;1) Including a conventional DO loop in a block of code.&lt;BR /&gt;[bash]            c = 0 
            do k = JEQ_bot,J-1 
               c = c + A(A_ptr_0+k) * B(B_ptr_0+k) 
            end do 
            A(A_ptr_0+J) = A(A_ptr_0+J) - c 
[/bash]&lt;P&gt;&lt;BR /&gt;2) Converting the DO Loop to Dot_Product, using array sections.&lt;/P&gt;[bash]            A_ptr_t = A_ptr_b + JBAND - 1 
            B_ptr_t = B_ptr_b + JBAND - 1 
            A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) 
[/bash]&lt;P&gt;&lt;BR /&gt;3) Wrapping the DO loop into a F77 style function&lt;/P&gt;[bash]            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum   (A(A_ptr_b), B(B_ptr_b), JBAND) 
where
      REAL*8 FUNCTION VEC_SUM (A, B, N) 
! 
      integer*4,               intent (in) :: n 
      real*8,    dimension(n), intent (in) :: a 
      real*8,    dimension(n), intent (in) :: b 
!
      real*8    c 
      integer*4 k 
! 
      c = 0 
      do k = 1,N 
         c = c + A(k) * B(k) 
      end do 
      vec_sum = c 
! 
      RETURN 
! 
      END 
[/bash]&lt;P&gt;&lt;BR /&gt;4) Wrapping the Dot_Product into a F77 style function, to avoid array sections.&lt;/P&gt;[bash]            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND) 
where
      REAL*8 FUNCTION VEC_SUM_d (A, B, N) 
! 
      integer*4,               intent (in) :: n 
      real*8,    dimension(n), intent (in) :: a 
      real*8,    dimension(n), intent (in) :: b 
! 
      vec_sum_d = dot_product ( a, b ) 
! 
      RETURN 
! 
      END 
[/bash]&lt;BR /&gt;5) Modified the DO loop with subscripts as temproary variables&lt;BR /&gt;[bash]            c = 0 
            do k = JEQ_bot,J-1 
               ia = A_ptr_0+k 
               ib = B_ptr_0+k 
               c = c + A(ia) * B(ib) 
            end do 
            A(A_ptr_0+J) = A(A_ptr_0+J) - c 
[/bash]&lt;BR /&gt;6) Modified DO loop for temporary subscripts as forauto increment&lt;BR /&gt;[bash]            c = 0 
            do k = JEQ_bot,J-1 
               c = c + A(A_ptr_b) * B(B_ptr_b) 
               A_ptr_b = A_ptr_b+1 
               B_ptr_b = B_ptr_b+1 
            end do 
            A(A_ptr_b) = A(A_ptr_b) - c 
[/bash]&lt;BR /&gt;A and B are real*8 vectors and all subscripts are integer*4&lt;BR /&gt;&lt;BR /&gt;I have tested these on a Xeon processor with compiler options:&lt;BR /&gt; /o1, /o2 or /o3 (/o2 is default)&lt;BR /&gt; /Qvect or /Qvect- (/Qvect is default)&lt;BR /&gt; /QxHost&lt;BR /&gt;&lt;BR /&gt;All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.&lt;BR /&gt;&lt;BR /&gt;However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.&lt;BR /&gt;&lt;BR /&gt;I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.&lt;BR /&gt;&lt;BR /&gt;Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.&lt;BR /&gt;&lt;BR /&gt;From this I conclude that for ifort:&lt;BR /&gt;/QxHost should not be used and&lt;BR /&gt;I should review my use of libraries of common calculations.&lt;BR /&gt;&lt;BR /&gt;Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?&lt;BR /&gt;&lt;BR /&gt;John</description>
      <pubDate>Mon, 14 May 2012 05:32:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823623#M49000</guid>
      <dc:creator>John_Campbell</dc:creator>
      <dc:date>2012-05-14T05:32:19Z</dc:date>
    </item>
    <item>
      <title>Optimisation of Do Loops</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823624#M49001</link>
      <description>What make and exact model CPU are you using? I would not expect /QxHost to make things worse.&lt;BR /&gt;&lt;BR /&gt;Your option 3 means that the compiler has no idea how long the loops are nor whether the arguments are aligned. Otherwise it looks ok. But I would suggest you look at the BLAS dot-product routines in MKL if performance is important.</description>
      <pubDate>Mon, 14 May 2012 15:18:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823624#M49001</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2012-05-14T15:18:41Z</dc:date>
    </item>
    <item>
      <title>Optimisation of Do Loops</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823625#M49002</link>
      <description>I have now updated ifort from Ver 11.1 to Ver 12.1.5.344, and the Xeon processor and to Win 7_64.&lt;BR /&gt;The problem I found with /QxHost has now been removed, so I am relieved that my old approach of using libraries of simple procedures can still be used. ( the problem still occured with Ver 11.1 and teh upgraded PC)&lt;BR /&gt;&lt;BR /&gt;I now need to go back and see what other problems I was having.&lt;BR /&gt;&lt;BR /&gt;The aim of this review ( and selection of ifort) has been to identify how to "parallel" a skyline direct solver for large sets of linear equations.&lt;BR /&gt;I have found that vectorising was easy to use, but I struggled with the advice I received last year on parallelizing my code. While I am very experienced in coding to F77 and F95, a consequence of this experience is that my age makes it more difficult to learn new techniques !&lt;BR /&gt;&lt;BR /&gt;If you could recommend to me sections of the ifort documentation that I should first read to better understand how to approach parallelizing, it would be appreciated.&lt;BR /&gt;&lt;BR /&gt;John</description>
      <pubDate>Wed, 22 Aug 2012 06:26:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823625#M49002</guid>
      <dc:creator>John_Campbell</dc:creator>
      <dc:date>2012-08-22T06:26:15Z</dc:date>
    </item>
    <item>
      <title>Optimisation of Do Loops</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823626#M49003</link>
      <description>I suggest that you do a build with Guided Auto Parallelization/Vectorization (GAP) on to see what the compiler has to say about what you might do differently. This is a build using /Qguide - it does not create an executable, but can output diagnostics with recommendations.</description>
      <pubDate>Wed, 22 Aug 2012 15:09:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimisation-of-Do-Loops/m-p/823626#M49003</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2012-08-22T15:09:13Z</dc:date>
    </item>
  </channel>
</rss>

