<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hotspot problem (found with vtune): do loop for initialising ve in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753026#M8872</link>
    <description>&lt;DIV id="tiny_quote"&gt;
                &lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=491837" class="basic" href="https://community.intel.com/en-us/profile/491837/"&gt;Guillaume De Nayer&lt;/A&gt;&lt;/DIV&gt;
                &lt;DIV style="background-color: #e5e5e5; padding: 5px; border: 1px inset; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Generally if I have a loop so:&lt;BR /&gt;do k=1,nkm&lt;BR /&gt; do i=1,nim&lt;BR /&gt; do j=1,njm&lt;BR /&gt; inp=ha(i,j,k)&lt;BR /&gt; be(inp)=pp(inp)*(1.-fx(inp))+pp(inp+idew)*fx(inp)&lt;BR /&gt; bn(inp)=pp(inp)*(1.-fy(inp))+pp(inp+idns)*fy(inp)&lt;BR /&gt; bt(inp)=pp(inp)*(1.-fz(inp))+pp(inp+idtb)*fz(inp)&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;Is it better to write this in 3 times ?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;/P&gt;The compiler may correct the nesting of your loops automatically. Your last question, if I understand it, may involve data locality questions. Normally, you might expect keeping the assignments together to improve efficiency of access to pp(), as well as avoiding re-reading ha().&lt;BR /&gt;</description>
    <pubDate>Wed, 16 Feb 2011 17:41:50 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2011-02-16T17:41:50Z</dc:date>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising vectors</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753011#M8857</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;I'm using the intel cluster toolkit on our cluster and so ifort (mpif90). I'm trying to optimize our program : I have tested gprof, oprofile and vtune (compiled with "-g -o2" or "-O3 -xT -Qdyncom"dummyblock" "). All the 3 programs show me the same result: one of our subroutine take 20% of all the time...there is a problem because this subroutine is not so important in the code.&lt;BR /&gt;&lt;BR /&gt;When I take a look into the sources with vtune, I see a surprising thing:&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt; su(inp)=d0&lt;BR /&gt; pp(inp)=d0&lt;BR /&gt;enddo&lt;BR /&gt;takes a lot of time (but the icst,icen are not sooo big)!!! All the others "do loops" for initialization take a lot of time too...&lt;BR /&gt;&lt;BR /&gt;So I have decided to write this into 2 "Do loops" :&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt;
 su(inp)=d0&lt;BR /&gt;
enddo&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt;

 pp(inp)=d0&lt;BR /&gt;

enddo&lt;BR /&gt;&lt;BR /&gt;I restart vtune and see that these 2 "do loop" don't take time anymore...But a new subroutine appears in the results: intel_new_memset and takes a lot of time.&lt;BR /&gt;&lt;BR /&gt;How can I interprete these results ? Could someone help me to understand why in this subroutine these "do loops" are a hotsport ?&lt;BR /&gt;&lt;BR /&gt;Thx a lot,&lt;BR /&gt;Best regards,&lt;BR /&gt;Guillaume</description>
      <pubDate>Mon, 14 Feb 2011 12:14:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753011#M8857</guid>
      <dc:creator>Guillaume_De_Nayer</dc:creator>
      <dc:date>2011-02-14T12:14:28Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753012#M8858</link>
      <description>You have expressed surprise at your timing results but the information that you have given is insufficient to make me share your surprise. In particular, you have not stated:&lt;BR /&gt;&lt;BR /&gt;  1) the types of the variables, &lt;BR /&gt; 2) the number of times that the subroutines are called,&lt;BR /&gt; 3) the number of repetitions in the DO loops,&lt;BR /&gt; 4) how the statements in question affect cache coherency, and&lt;BR /&gt; 5) the number of times that "intel_net_memset" is called.&lt;BR /&gt;&lt;BR /&gt;It is hardly surprising that the optimizer can substitute &lt;BR /&gt;&lt;BR /&gt; CALL MEMSET(su,d0,(icen+1-icst)*sizeof(int))&lt;BR /&gt;&lt;BR /&gt;for&lt;BR /&gt;&lt;BR /&gt; do inp=icst,icen&lt;BR /&gt;   su(inp)=d0&lt;BR /&gt; enddo&lt;BR /&gt;&lt;BR /&gt;However, when there is more than one assignment in the DO loop, similar calls can be made only if the optimizer can ascertain that the arrays (whose elements are being set) are distinct, have no overlap, etc. The language standards may impose some restrictions on how DO loops and other constructs are compiled into machine code, particularly if the arrays are dummy arguments.&lt;BR /&gt;&lt;BR /&gt;Statements such as "a lot of time" and "not important" are subjective, and can make sense only when the question "compared to what?" can be answered in each instance. It makes a difference if the surprise is caused by (i) mere novelty or (ii) failure to meet well-reasoned expectations.&lt;BR /&gt;</description>
      <pubDate>Mon, 14 Feb 2011 13:43:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753012#M8858</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2011-02-14T13:43:22Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753013#M8859</link>
      <description>Hi! thx for your answer.&lt;BR /&gt;&lt;BR /&gt;1)&lt;BR /&gt;integer inp,icst,icen&lt;BR /&gt;real*8 su(nxyza),pp(nxyza)&lt;BR /&gt;&lt;BR /&gt;2)&lt;BR /&gt;our problematic subroutine is called 8 times per time step and for our test with Vtune I have run the program with 100 time steps. So our problematic subroutine was called 800 times (confirmed by gprof).&lt;BR /&gt;&lt;BR /&gt;3)&lt;BR /&gt;The number of repetitions is different for each thread. But here for this test:&lt;BR /&gt;icst=1&lt;BR /&gt;icen &amp;lt; 10000&lt;BR /&gt;&lt;BR /&gt;4)&lt;BR /&gt;I don't understand your question, sorry. Do you mean the cache of the processor ?&lt;BR /&gt;our processors are:&lt;BR /&gt;- intel Xeon E5620: &lt;A href="http://ark.intel.com/Product.aspx?id=47925" target="_blank"&gt;http://ark.intel.com/Product.aspx?id=47925&lt;/A&gt;&lt;BR /&gt;- intel Xeon X5650: &lt;A href="http://ark.intel.com/Product.aspx?id=47922" target="_blank"&gt;http://ark.intel.com/Product.aspx?id=47922&lt;/A&gt;&lt;BR /&gt;So 12M of cache.&lt;BR /&gt;&lt;BR /&gt;5)&lt;BR /&gt;Where can I see this ??? under Vtune ???&lt;BR /&gt;With gprof I get:&lt;BR /&gt;Flat profile:&lt;BR /&gt;Each sample counts as 0.01 seconds.&lt;BR /&gt; % cumulative self self total&lt;BR /&gt;time seconds seconds calls s/call s/call name&lt;BR /&gt;15.19 2.65 2.65 __intel_new_memset&lt;BR /&gt;12.39 4.81 2.16 40000 0.00 0.00 resforward3d_&lt;BR /&gt; 9.81 6.52 1.71 800 0.00 0.01 calcp_exp_&lt;BR /&gt;&lt;BR /&gt;So I don't see how many times intel_new_memset was called.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;So If I understand:&lt;BR /&gt;with &lt;BR /&gt;do inp=icst,icen&lt;BR /&gt; su(inp)=d0&lt;BR /&gt; pp(inp)=d0&lt;BR /&gt;enddo&lt;BR /&gt;The optimizer doesn't replace them with "CALL MEMSET", but with&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt;
 su(inp)=d0&lt;BR /&gt;
enddo&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt;

 pp(inp)=d0&lt;BR /&gt;

enddo&lt;BR /&gt;the both are replaced with this sub ?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Im' surprising with these results because the program is a fluid solver. And what takes time in a fluid solver is the fluid problem to solve...and this subroutine doesn't solve the fluid problem...There is "more" "Do loop" in the sub where the fluid system is solved as in this problematic subroutine...&lt;BR /&gt;&lt;BR /&gt;Thx for your help.&lt;BR /&gt;Guillaume&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 14 Feb 2011 14:27:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753013#M8859</guid>
      <dc:creator>Guillaume_De_Nayer</dc:creator>
      <dc:date>2011-02-14T14:27:29Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753014#M8860</link>
      <description>The compiler makes the automatic substitution of memset when you zero a single array. If the loop is long enough to be a major hot spot, this is a good substitution, particularly as it enables the library to decide at run time whether to use non-temporal store (bypassing the zeroing out of cache). Nontemporal is likely to come out ahead for arrays which span a large fraction of cache.&lt;BR /&gt;You should check whether the additional time spent by memset is less than the time saved in the caller function.&lt;BR /&gt;You could encourage the compiler to use nontemporal on one or both of the arrays by setting !dir$ vector nontemporal. Due to the probable impossibility of determining alignment at compile time, the loop has to be split into 2 memset calls to use nontemporal for both. &lt;BR /&gt;The old -xT option, in case you didn't notice it, is meant specifically for early Core 2 or Woodcrest CPUs; it's not so good for current CPUs, even if you change to current spelling.&lt;BR /&gt;Of course, you should check whether you need to zero the array so often, or whether you can postpone that until you are using it for other purposes.</description>
      <pubDate>Mon, 14 Feb 2011 15:15:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753014#M8860</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-02-14T15:15:09Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753015#M8861</link>
      <description>Guillaume,&lt;BR /&gt;&lt;BR /&gt;If the zeroing of these two arrays consume 20% of the run time, then either&lt;BR /&gt;&lt;BR /&gt; a) the remainder of your program (the other 80%) does very little work&lt;BR /&gt;or&lt;BR /&gt; b) you may be unnecessarily zeroingout the arrays.&lt;BR /&gt;&lt;BR /&gt;Note, in an MPI cluster, the controlling program alone may be performing the zeroing (initializing) of the arrays. If (icen-icst) is quite large, then consider using more than one thread to zero the arrays. If OpenMP can be blended into your app, try&lt;BR /&gt;&lt;BR /&gt;!$omp parallel sections&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt; su(inp)=d0&lt;BR /&gt;enddo&lt;BR /&gt;!$omp section&lt;BR /&gt;do inp=icst,icen&lt;BR /&gt; pp(inp)=d0&lt;BR /&gt;enddo&lt;BR /&gt;!$omp end parallel sections&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 14 Feb 2011 15:39:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753015#M8861</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-14T15:39:26Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753016#M8862</link>
      <description>If you follow Jim's suggestion, and this doesn't produce a memset substitution, you should try the !dir$ vector nontemporal. If using just 2 threads, you may need to take action to distribute the threads between CPUs.</description>
      <pubDate>Mon, 14 Feb 2011 16:50:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753016#M8862</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-02-14T16:50:27Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753017#M8863</link>
      <description>you can also experiment with disabling the call to memset with option&lt;BR /&gt;&lt;BR /&gt;-nolib-inline&lt;BR /&gt;&lt;BR /&gt;and see if this helps. I have seen some code where the memset is not as fast as I'd like.&lt;BR /&gt;&lt;BR /&gt;ron</description>
      <pubDate>Mon, 14 Feb 2011 18:37:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753017#M8863</guid>
      <dc:creator>Ron_Green</dc:creator>
      <dc:date>2011-02-14T18:37:44Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753018#M8864</link>
      <description>Out of curiosity, should there be any difference as far as the compiler is concerned between&lt;BR /&gt;&lt;BR /&gt; do inp=icst,icen&lt;BR /&gt;    su(inp) = 0.d0&lt;BR /&gt; end do&lt;BR /&gt;&lt;BR /&gt;and&lt;BR /&gt;&lt;BR /&gt; su(icst:icen) = 0.d0&lt;BR /&gt;&lt;BR /&gt;in terms of optimization or performance?&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Andrew McLeod</description>
      <pubDate>Tue, 15 Feb 2011 09:24:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753018#M8864</guid>
      <dc:creator>andrewmcleod</dc:creator>
      <dc:date>2011-02-15T09:24:00Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753019#M8865</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;thx for your suggestions:&lt;BR /&gt;&lt;BR /&gt;- We don't want to use openMP (at the moment).&lt;BR /&gt;- I'm checking for each arrays if initializing is necessary.&lt;BR /&gt;&lt;BR /&gt;- The "-xT" option is now removed, but I don't see any but differences. Is there an option which optimize automatically the code for the processor, where the program is compiled ? (our code is compiled and running on a lot of different platforms...and these platforms are periodically updated).&lt;BR /&gt;&lt;BR /&gt;- I have tested with the "nontemporal" directive. But I don't see any improvements or differences. Could you share links about this directive and her friends. What I found on the net was not very clear (for me... :) ). So perhaps I did something wrong. &lt;BR /&gt;&lt;BR /&gt;- I have tested with "-nolib-inline". The sum of the cpu time of intel_new_memset + "problematic subroutine" (without -nolib-inline) is almost equal to the cpu time of the "problematic subroutine" with "-nolib-inline". So with or without memset I have the "same" results.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;In the NEW figure in attachment, we can see 8 "DO loops" with the same number of iterations "nxyza". Why are there "DO loop", which take 10 times longer ??? (for the vectors bw and bt). Is there any influence of the previous values in the vector ? I mean, if the vector is already set to zero, goes it faster than if it is full of 10^10 ??? Or is it a problem with the place of the vectors in the memory ?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I hace tested with "su(icst:icen) = 0.d0". But I don't notice any difference.&lt;BR /&gt;&lt;BR /&gt;Thx a lot for all your suggestion and your help!&lt;BR /&gt;Guillaume</description>
      <pubDate>Tue, 15 Feb 2011 10:47:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753019#M8865</guid>
      <dc:creator>Guillaume_De_Nayer</dc:creator>
      <dc:date>2011-02-15T10:47:46Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753020#M8866</link>
      <description>Guillaume,&lt;BR /&gt;&lt;BR /&gt;In looking at your .jpeg screenshot, the loops for bw and bt are consuming 10x the run time counts as the other loops. All the other loops were using the same extent for iterations. Therefore I suspect that bw and bt arrays may have an alignment issue. These arrays should at least be on an 8 byte boundary but may work more efficiently on a 16-byte boundary.&lt;BR /&gt;&lt;BR /&gt;Run a quick check to see what the alignment of bw(1), bt(1), and the other arrays. Try to get the address in hexadecimal. If the address does not end with 0 or 8, then for some reasonthese arrays are not "naturally alligned" (not on a boundary equal to the size of the element, in this case double==8).&lt;BR /&gt;&lt;BR /&gt;The other thing toconsider:&lt;BR /&gt;&lt;BR /&gt; Are bw and bt arrays of doubles?&lt;BR /&gt;&lt;BR /&gt;If these arrays are NOT doubles (REAL(8)) then initializing with a double (REAL(8)) will cause the loop to contain a conversion operation. This too could account for the excess run time in those loops.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Tue, 15 Feb 2011 13:05:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753020#M8866</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-15T13:05:05Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753021#M8867</link>
      <description>Hi Jim,&lt;BR /&gt;&lt;BR /&gt;How can I get the adress in hexadecimal of bw(1), bt(1) in FORTRAN ? Could you tell me, please ?&lt;BR /&gt;&lt;BR /&gt;All the arrays are doubles (REAL(8)).&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Guillaume</description>
      <pubDate>Tue, 15 Feb 2011 14:16:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753021#M8867</guid>
      <dc:creator>Guillaume_De_Nayer</dc:creator>
      <dc:date>2011-02-15T14:16:08Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753022#M8868</link>
      <description>The legacy function loc(), or the similar standard c_loc function from iso_c_binding, give you the address. You want to know whether the low order bits are zero.&lt;BR /&gt;IF(iand(c_loc(bw(1)),3) /= 0)write(*,*) "bw not aligned for real(8)"&lt;BR /&gt;As Jim pointed out, such a misalignment could account for poor performance, except that the memset substitution would apply a run-time correction, to the extent that is possible.&lt;BR /&gt;You could go further; if you can assure that iand(c_loc(array),7) == 0 for all arrays, you can put them all in one loop, and set &lt;BR /&gt;!dir$ vector nontemporal &lt;BR /&gt;!dir$ vector aligned &lt;BR /&gt;do ....&lt;BR /&gt;for that loop. In such a case, you could expect to gain performance by zeroing up to 8 arrays in one loop.</description>
      <pubDate>Tue, 15 Feb 2011 16:33:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753022#M8868</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-02-15T16:33:15Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753023#M8869</link>
      <description>Guillaume,&lt;BR /&gt;&lt;BR /&gt;If you are using the debugger, it should have an option to toggle between decimal and hexadecimal. As to how you do this it depends on the debugger. In Visual Studio, you hover the mouse over the variable, it shows value in current radix, if not the radix you want, then right click (as you hover and see wrong radix value), then choose Hexadicimal.&lt;BR /&gt;&lt;BR /&gt;And alternate way is to perform within the debuggera memory dump at the variable, then look at the left most column, which will show you the address in hexadecimal, the remaining columns will show the contents.&lt;BR /&gt;&lt;BR /&gt;If you want to do this from within FORTRAN use the Zx[.m] edit descriptor. Something like&lt;BR /&gt;&lt;BR /&gt; write(*,'(Z20)') loc(array(1))&lt;BR /&gt;&lt;BR /&gt;Alternativley, remember we are looking for the data alignment as opposed to actual address, use&lt;BR /&gt;&lt;BR /&gt; write(*,*) mod(loc(array(1)),16)&lt;BR /&gt;&lt;BR /&gt;A return value of 0 or 8 is good, anything else and you (somehow) have an alignment issue.&lt;BR /&gt;Note, when 0, the array access using the SSE vector instructions (compiler does this for you) may be a tad faster. Don't worry about the aligned to 16 issue until later.&lt;BR /&gt;&lt;BR /&gt;By the way, were the two arrays in question arrays of doubles (real(8))?&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Wed, 16 Feb 2011 14:05:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753023#M8869</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-16T14:05:42Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753024#M8870</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;pfiouuuuu too many new FORTRAN functions in 2 days ?!? :D&lt;BR /&gt;&lt;BR /&gt;I have tried:&lt;BR /&gt;IF(iand(loc(bw(1)),3) /= 0)write(*,*) "bw not aligned"&lt;BR /&gt;and the same for all my arrays...NO PROBLEM.&lt;BR /&gt;&lt;BR /&gt;I have tried too:&lt;BR /&gt;write(*,*) mod(loc(array(1)),16)&lt;BR /&gt;And the results for all the arrays are 0.&lt;BR /&gt;&lt;BR /&gt;So I don't have any problem with alignment for these vectors.&lt;BR /&gt;I have tried to use the directives with untemporal and aligned...But it seems slower.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I found that a lot of the initialising loops are not necessary in our code. So the program runs 10% faster now :D.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Generally if I have a loop so:&lt;BR /&gt;do k=1,nkm&lt;BR /&gt; do i=1,nim&lt;BR /&gt; do j=1,njm&lt;BR /&gt; inp=ha(i,j,k)&lt;BR /&gt; be(inp)=pp(inp)*(1.-fx(inp))+pp(inp+idew)*fx(inp)&lt;BR /&gt; bn(inp)=pp(inp)*(1.-fy(inp))+pp(inp+idns)*fy(inp)&lt;BR /&gt; bt(inp)=pp(inp)*(1.-fz(inp))+pp(inp+idtb)*fz(inp)&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;Is it better to write this in 3 times ?&lt;BR /&gt;&lt;BR /&gt;Thx for all your suggestions!!!&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 16 Feb 2011 15:30:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753024#M8870</guid>
      <dc:creator>Guillaume_De_Nayer</dc:creator>
      <dc:date>2011-02-16T15:30:42Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753025#M8871</link>
      <description>do k=1,nkm&lt;BR /&gt; do i=1,nim&lt;BR /&gt; do j=1,njm&lt;BR /&gt;*** reorder indexes on ha to inner do loop to outer do loop order&lt;BR /&gt;*** IOW ha(j,i,k)&lt;BR /&gt;*** or reorderpreceeding loops to order k,j,i&lt;BR /&gt; inp=ha(i,j,k)&lt;BR /&gt;*** see notes below&lt;BR /&gt; be(inp)=pp(inp)*(1.-fx(inp))+pp(inp+idew)*fx(inp)&lt;BR /&gt; bn(inp)=pp(inp)*(1.-fy(inp))+pp(inp+idns)*fy(inp)&lt;BR /&gt; bt(inp)=pp(inp)*(1.-fz(inp))+pp(inp+idtb)*fz(inp)&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;If(nkm * nim * njm) is relatively large .AND. the above loopconsumes a considerable portion of run time, consider the following&lt;BR /&gt;&lt;BR /&gt;logical(1) :: haIndex(nkm * nim * njm)&lt;BR /&gt;...&lt;BR /&gt;haIndex =.false. ! set index found to .false.&lt;BR /&gt;do k=1,nkm&lt;BR /&gt; do j=1,nim&lt;BR /&gt; do i=1,njm&lt;BR /&gt; haIndex(ha(i,j,k) = .true.&lt;BR /&gt; end do&lt;BR /&gt; end do&lt;BR /&gt;end do&lt;BR /&gt;do inp = 1,nkm * nim * njm&lt;BR /&gt; if(haIndex(inp)) then&lt;BR /&gt;be(inp)=pp(inp)*(1.-fx(inp))+pp(inp+idew)*fx(inp)&lt;BR /&gt;bn(inp)=pp(inp)*(1.-fy(inp))+pp(inp+idns)*fy(inp)&lt;BR /&gt;bt(inp)=pp(inp)*(1.-fz(inp))+pp(inp+idtb)*fz(inp)&lt;BR /&gt; endif&lt;BR /&gt;end do&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Wed, 16 Feb 2011 16:01:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753025#M8871</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-16T16:01:21Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753026#M8872</link>
      <description>&lt;DIV id="tiny_quote"&gt;
                &lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=491837" class="basic" href="https://community.intel.com/en-us/profile/491837/"&gt;Guillaume De Nayer&lt;/A&gt;&lt;/DIV&gt;
                &lt;DIV style="background-color: #e5e5e5; padding: 5px; border: 1px inset; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Generally if I have a loop so:&lt;BR /&gt;do k=1,nkm&lt;BR /&gt; do i=1,nim&lt;BR /&gt; do j=1,njm&lt;BR /&gt; inp=ha(i,j,k)&lt;BR /&gt; be(inp)=pp(inp)*(1.-fx(inp))+pp(inp+idew)*fx(inp)&lt;BR /&gt; bn(inp)=pp(inp)*(1.-fy(inp))+pp(inp+idns)*fy(inp)&lt;BR /&gt; bt(inp)=pp(inp)*(1.-fz(inp))+pp(inp+idtb)*fz(inp)&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;Is it better to write this in 3 times ?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;/P&gt;The compiler may correct the nesting of your loops automatically. Your last question, if I understand it, may involve data locality questions. Normally, you might expect keeping the assignments together to improve efficiency of access to pp(), as well as avoiding re-reading ha().&lt;BR /&gt;</description>
      <pubDate>Wed, 16 Feb 2011 17:41:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753026#M8872</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-02-16T17:41:50Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753027#M8873</link>
      <description>I forgot to add to my prior post of using a table of logicals (for which elements get calculated)&lt;BR /&gt;The reason for the use of the haIndex array is when processing the computation loop the be(inp)= ...are sequential (with skips for unmarked entries). This is friendlier to cache utilization.&lt;BR /&gt;&lt;BR /&gt;When the runs of .true.'s and/or .false.'s are relatively long, then consider using an array of integers in place of the logical arraycontaining:&lt;BR /&gt;&lt;BR /&gt; +n == n .true.'s&lt;BR /&gt; -n == n .false.'s&lt;BR /&gt; 0 == end of table (or you can use a count)&lt;BR /&gt;&lt;BR /&gt;With this table computed, then the do loop to compute the values always cover a sequential range IOW no if test in the do loop. Eliminating the if test will improve the vectorability of the code.&lt;BR /&gt;&lt;BR /&gt; iFill = 1&lt;BR /&gt; iPick = 1&lt;BR /&gt; do while(counts(iPick) .ne. 0)&lt;BR /&gt; if(counts(iPick) .gt. 0) then&lt;BR /&gt; do inp=iFill,iFill+counts(iPick)-1&lt;BR /&gt;be(inp)=pp(inp)*(1.-fx(inp))+pp(inp+idew)*fx(inp)&lt;BR /&gt; bn(inp)=pp(inp)*(1.-fy(inp))+pp(inp+idns)*fy(inp)&lt;BR /&gt; bt(inp)=pp(inp)*(1.-fz(inp))+pp(inp+idtb)*fz(inp)&lt;BR /&gt; end do&lt;BR /&gt; iFill = iFill + counts(iPick)&lt;BR /&gt; else&lt;BR /&gt; iFill = iFill- counts(iPick)&lt;BR /&gt;endif&lt;BR /&gt; iPick = iPick+1&lt;BR /&gt; end do&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Wed, 16 Feb 2011 23:34:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753027#M8873</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-16T23:34:51Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753028#M8874</link>
      <description>@jimdempseyatthecove&lt;BR /&gt;ha(j,i,k) is not an array. But a function:&lt;BR /&gt;ha(ii,jj,kk)=lk(kk)+li(ii)+jj&lt;BR /&gt;&lt;BR /&gt;So it is not a problem to use:&lt;BR /&gt;do k=1,nkm&lt;BR /&gt;
 do i=1,nim&lt;BR /&gt;
 do j=1,njm&lt;BR /&gt; inp=ha(i,j,k)&lt;BR /&gt; enddo&lt;BR /&gt;
 enddo&lt;BR /&gt;
enddo&lt;BR /&gt;&lt;BR /&gt;it is a little strange...I know, but the previous developpers choose this way.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;@all&lt;BR /&gt;I'm experimenting your suggestions for the loops.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 17 Feb 2011 08:16:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753028#M8874</guid>
      <dc:creator>Guillaume_De_Nayer</dc:creator>
      <dc:date>2011-02-17T08:16:02Z</dc:date>
    </item>
    <item>
      <title>Hotspot problem (found with vtune): do loop for initialising ve</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753029#M8875</link>
      <description>&amp;gt;&amp;gt;ha(j,i,k) is not an array. But a function:&lt;BR /&gt;ha(ii,jj,kk)=lk(kk)+li(ii)+jj&lt;BR /&gt;&lt;BR /&gt;Then assuming lk() and li() are arrays you would want the j index of the caller's loop to be the inner most loop (as-was written). It would not hurt to add a comment explaining this at the point in the code where the function call is made. This would serve as a notice to the next person supporting this code such that they will not make the same generalization mistake that I did.&lt;BR /&gt;&lt;BR /&gt;Building the table of LOGICALS indicating referenced/not referenced, then building using that list to discover adjacent .true.'s and adjacent .false.'s might be sufficient (i.e. table of counts can be omitted)&lt;BR /&gt;&lt;BR /&gt;haPicked = .false.&lt;BR /&gt;do k=1,nkm&lt;BR /&gt; do i=1,nim&lt;BR /&gt; do j=1,njm&lt;BR /&gt; haPicked(ha(i,j,k)) = .true.&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;last = .false.&lt;BR /&gt;count = 0&lt;BR /&gt;inp = 1&lt;BR /&gt;do i=1,nkm*nim*njm&lt;BR /&gt; if(last) then&lt;BR /&gt; ! last was .true.&lt;BR /&gt; if(haPicked(i)) then&lt;BR /&gt; ! increment count of .true.'s in a row seen&lt;BR /&gt; count = count+1&lt;BR /&gt; else&lt;BR /&gt; ! end of .true.'s in a row&lt;BR /&gt; ! now compute results for those entries&lt;BR /&gt; do j=1,count&lt;BR /&gt; bt(inp) = ...&lt;BR /&gt; ...&lt;BR /&gt; ...&lt;BR /&gt; end do&lt;BR /&gt; inp = inp + count&lt;BR /&gt; count = 0&lt;BR /&gt; last = .false.&lt;BR /&gt; endif&lt;BR /&gt; else&lt;BR /&gt; ! last was .false&lt;BR /&gt; if(haPicked(i)) then&lt;BR /&gt; ! switching from .false.'s to .true.('s)&lt;BR /&gt; ! advance over .false.'s&lt;BR /&gt;inp = inp + count&lt;BR /&gt; last = .true.&lt;BR /&gt; count = 1&lt;BR /&gt; else&lt;BR /&gt; count = count + 1&lt;BR /&gt; endif&lt;BR /&gt; endif&lt;BR /&gt;end do&lt;BR /&gt;if(last) then&lt;BR /&gt; do j=1,count&lt;BR /&gt; bt(inp) = ...&lt;BR /&gt; ...&lt;BR /&gt; ...&lt;BR /&gt; end do&lt;BR /&gt;endif&lt;BR /&gt;&lt;BR /&gt;something like the above would work.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Thu, 17 Feb 2011 21:44:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Hotspot-problem-found-with-vtune-do-loop-for-initialising/m-p/753029#M8875</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-17T21:44:53Z</dc:date>
    </item>
  </channel>
</rss>

