<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic If you believe there's a in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992337#M3385</link>
    <description>If you believe there's a difference in compilation, comparing the results of opt-report-file option should verify it.  If you don't want to give a working example, you could at least quote the differences in the reports, in case they mean more to us than to you.
Again, you'd get more expert opinions on the relevant Fortran forum, in case that's what you are interested in.</description>
    <pubDate>Mon, 17 Sep 2012 19:33:42 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2012-09-17T19:33:42Z</dc:date>
    <item>
      <title>Global Variables with ifort</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992334#M3382</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; I'm using Fortran compiler from intel and I'm getting performance issues when I use global variables (arrays). Basically, I have a four dimension array, and a loop processing all of its elements. When I pass these array as a parameter for the subroutine, I have an execution time. When I use it directly inside the routine as a global variable, the execution time is the double from previous one. I'm guessing the compiler disables some optimizations when I use a global variable. Is it the case? If yes, how can I enable it again? If no, does anyone have any idea why this slow down?&lt;/P&gt;
&lt;P&gt;The arrays are declared on a global module like this:&lt;/P&gt;
&lt;P&gt;real, allocatable, target :: ux(:,:,:,:), uy(:,:,:,:)&lt;/P&gt;
&lt;P&gt;And allocated with:&lt;/P&gt;
&lt;P&gt;allocate(ux(nmin1-4:nmax1+4+u_pad, nmin2-4:nmax2+4,&amp;nbsp;nmin3-4:nmax3+4,-1:3))&lt;/P&gt;
&lt;P&gt;Any idea what is happening?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Sep 2012 13:45:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992334#M3382</guid>
      <dc:creator>Rafael_Silva</dc:creator>
      <dc:date>2012-09-17T13:45:11Z</dc:date>
    </item>
    <item>
      <title>If you've read the ifort</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992335#M3383</link>
      <description>If you've read the ifort documentation but still are having difficulty understanding the compiler's optimization reports, you could follow up with an actual code sample on the Fortran forum appropriate to your platform (Windows or linux and MAC).
Your description leaves too much up to the imagination.</description>
      <pubDate>Mon, 17 Sep 2012 14:18:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992335#M3383</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-09-17T14:18:33Z</dc:date>
    </item>
    <item>
      <title>I've read the documentation</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992336#M3384</link>
      <description>I've read the documentation and didn't found anything specific to the globals. Considering, this is related to optimization I thought this was the correct place to post. I can't see what is missing on the description: the central point is: I changed a array from parameter to global and it slown down the code doubling the execution time (the same code, same allocation, same types, same names). I would like to know, what is missing to describe...</description>
      <pubDate>Mon, 17 Sep 2012 14:32:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992336#M3384</guid>
      <dc:creator>Rafael_Silva</dc:creator>
      <dc:date>2012-09-17T14:32:10Z</dc:date>
    </item>
    <item>
      <title>If you believe there's a</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992337#M3385</link>
      <description>If you believe there's a difference in compilation, comparing the results of opt-report-file option should verify it.  If you don't want to give a working example, you could at least quote the differences in the reports, in case they mean more to us than to you.
Again, you'd get more expert opinions on the relevant Fortran forum, in case that's what you are interested in.</description>
      <pubDate>Mon, 17 Sep 2012 19:33:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992337#M3385</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-09-17T19:33:42Z</dc:date>
    </item>
    <item>
      <title>I've posted on Intel Fortran</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992338#M3386</link>
      <description>I've posted on Intel Fortran Compiler forum, but got no answer. If you think is more accurate to move this topic there, please do it (or someone who can do that).

This is the loop:

        do k=nmin3-4,nmax3+4
          do j=nmin2-4,nmax2+4
            do i=nmin1-4,nmax1+4
              ux(i,j,k,3) = (20.*ux(i,j,k,2) - 6.*ux(i,j,k, 1) - 4.*ux(i,j,k,0) +    ux(i,j,k,-1) + 12.*ux(i,j,k,3)*dt2)*ctt

              uy(i,j,k,3) = (20.*uy(i,j,k,2) - 6.*uy(i,j,k, 1) - 4.*uy(i,j,k,0) +    uy(i,j,k,-1) + 12.*uy(i,j,k,3)*dt2)*ctt

              uz(i,j,k,3) = (20.*uz(i,j,k,2) - 6.*uz(i,j,k, 1) - 4.*uz(i,j,k,0) +    uz(i,j,k,-1) + 12.*uz(i,j,k,3)*dt2)*ctt
            end do
          end do
        end do

This is the declaration:
        real, allocatable, target :: ux(:,:,:,:), uy(:,:,:,:)
        real, allocatable, target :: uz(:,:,:,:)

This is the allocation:
         allocate(ux(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))
         allocate(uy(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))
         allocate(uz(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))

Case 1:
     
        They are declared/allocated inside a subroutine called: Source. This subroutine calls another one, called
Update, passing ux,uy,uz as parameters, which are "defined" by Update subroutine like this:

        real :: ux(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3)
        real :: uy(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3)
        real :: uz(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3)

        In fact, it is the same definition from the caller subroutine, Source.

Case 2:
  
        They are still allocated by Source subroutine, but declared as globals on
another module accessible to Source and Update subroutines. So, the subroutine
update just access them as global variables, with no need to define or pass as
parameter.

My problem is: case 2 is two times slower than case 1.

In case 1, the line numbers are:
19:        do k=nmin3-4,nmax3+4
20:          do j=nmin2-4,nmax2+4
21:            do i=nmin1-4,nmax1+4


The HLO report for case 1 on these lines is:

LOOP DISTRIBUTION in update_3d_ at line 21
LOOP DISTRIBUTION in update_3d_ at line 21
LOOP DISTRIBUTION in update_3d_ at line 21


&lt;SRC&gt;
Loop Interchange not done due to: Original Order seems proper
Advice: Loop Interchange might help Loopnest at lines: 19 20 21
      : Original Order found to be proper, but by a close margin


In case 2, the line numbers are:
12:        do k=nmin3-4,nmax3+4
13:          do j=nmin2-4,nmax2+4
14:            do i=nmin1-4,nmax1+4

and the report says:

&lt;SRC&gt;
LOOP DISTRIBUTION in update_3d_ at line 14


&lt;SRC&gt;
Loop Interchange not done due to: Original Order seems proper
Advice: Loop Interchange might help Loopnest at lines: 12 13 14
      : Original Order found to be proper, but by a close margin


I can see that, there is two more LOOP DISTRIBUTION lines
on the first case. This the only difference I see. Any idea what
is causing this and why?&lt;/SRC&gt;&lt;/SRC&gt;&lt;/SRC&gt;</description>
      <pubDate>Mon, 17 Sep 2012 23:24:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992338#M3386</guid>
      <dc:creator>Rafael_Silva</dc:creator>
      <dc:date>2012-09-17T23:24:50Z</dc:date>
    </item>
    <item>
      <title>Are you saying that neither</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992339#M3387</link>
      <description>Are you saying that neither case (or maybe both) show full vectorization, so you don't see a difference there?  The compiler is more likely to perform distribution on vectorized loops, but I certainly wouldn't like to rely on that as the only indicator.</description>
      <pubDate>Tue, 18 Sep 2012 00:12:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992339#M3387</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-09-18T00:12:43Z</dc:date>
    </item>
    <item>
      <title>You're right, I was not</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992340#M3388</link>
      <description>You're right, I was not enabling the full report.
The faster version (case 1) says:

(629:13-629:13):VEC:sourcewave_:  PARTIAL LOOP WAS VECTORIZED

The slower version (case 2) says:

&lt;SRC&gt;
HPO Vectorizer Report (update_3d_)

src/Update.f90(14): (col. 13) remark: loop was not vectorized: existence of vector dependence.
src/Update.f90(21): (col. 15) remark: vector dependence: assumed ANTI dependence between (unknown) line 21 and (unknown) line 15.
src/Update.f90(15): (col. 15) remark: vector dependence: assumed FLOW dependence between (unknown) line 15 and (unknown) line 21.
src/Update.f90(20): (col. 15) remark: vector dependence: assumed ANTI dependence between (unknown) line 20 and (unknown) line 15.
(...)


But, why? Considering it is the same code, why the compiler consider one code with dependence and the other not?&lt;/SRC&gt;</description>
      <pubDate>Tue, 18 Sep 2012 20:47:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992340#M3388</guid>
      <dc:creator>Rafael_Silva</dc:creator>
      <dc:date>2012-09-18T20:47:38Z</dc:date>
    </item>
    <item>
      <title>Forcing the vectorization,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992341#M3389</link>
      <description>Forcing the vectorization, using the directive to say there is no loop carried dependence, makes the execution time almost the same, for the both cases.

But I still can't understand why the compiler consider one case with dependences and the other not. The code is the same...I just change the declarations from local to global.</description>
      <pubDate>Wed, 19 Sep 2012 12:01:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Global-Variables-with-ifort/m-p/992341#M3389</guid>
      <dc:creator>Rafael_Silva</dc:creator>
      <dc:date>2012-09-19T12:01:23Z</dc:date>
    </item>
  </channel>
</rss>

