<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Quote:Gregg S. (Intel) wrote: in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081576#M7101</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Gregg S. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. &amp;nbsp;Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I think this is helpful, but why the behaviors on the PC and Cluster are so different, can you give more details?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 04 Aug 2016 03:01:04 GMT</pubDate>
    <dc:creator>ZT_X_</dc:creator>
    <dc:date>2016-08-04T03:01:04Z</dc:date>
    <item>
      <title>Different perfermance on PC and cluster of Fortran-OpenMP code</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081572#M7097</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; Hello! I have encountered a problem when programming Fortran-OpenMP code. I used a PARALLEL DO clause to parallel a time-consuming part of my Fortran code. However, different perfermance of the code on the PC (Intel(R) Core(TM) i7-3770 @ 3.4GHz) and the cluster (Intel(R) Xeon(R) CPU E5-2620 @ 2.10GHz) was found. The perfermeance on the PC was statisfactory, about 74% parallel efficiency for 4 process; however only 26% on the cluster.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp;I'm frastruted by this problem for long time and I think there may be something beyond my knowledge, so I'm seaching for help there. Wish your help, thank you very much!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 06:39:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081572#M7097</guid>
      <dc:creator>ZT_X_</dc:creator>
      <dc:date>2016-08-03T06:39:09Z</dc:date>
    </item>
    <item>
      <title>How many iterations does the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081573#M7098</link>
      <description>&lt;P&gt;How many iterations does the DO perform?&lt;/P&gt;

&lt;P&gt;How many threads are available inside the parallel region?&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 16:37:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081573#M7098</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-08-03T16:37:15Z</dc:date>
    </item>
    <item>
      <title>When initializing memory on</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081574#M7099</link>
      <description>&lt;P&gt;When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. &amp;nbsp;Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 22:44:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081574#M7099</guid>
      <dc:creator>Gregg_S_Intel</dc:creator>
      <dc:date>2016-08-03T22:44:15Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081575#M7100</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;How many iterations does the DO perform?&lt;/P&gt;

&lt;P&gt;How many threads are available inside the parallel region?&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 15.6096px;"&gt;The iterations are controled by "n_vbns" are many thouands.&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 15.6096px;"&gt;The threads availabe are controled by "n_cpu" is within 8 on PC and within 12 on the cluster.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2016 02:50:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081575#M7100</guid>
      <dc:creator>ZT_X_</dc:creator>
      <dc:date>2016-08-04T02:50:31Z</dc:date>
    </item>
    <item>
      <title>Quote:Gregg S. (Intel) wrote:</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081576#M7101</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Gregg S. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. &amp;nbsp;Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I think this is helpful, but why the behaviors on the PC and Cluster are so different, can you give more details?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2016 03:01:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081576#M7101</guid>
      <dc:creator>ZT_X_</dc:creator>
      <dc:date>2016-08-04T03:01:04Z</dc:date>
    </item>
    <item>
      <title>ZT,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081577#M7102</link>
      <description>&lt;P&gt;ZT,&lt;/P&gt;

&lt;P&gt;When you post code, please copy to clipboard, then in the forum, click on the button&lt;/P&gt;

&lt;P&gt;{...}&lt;BR /&gt;
	code&lt;/P&gt;

&lt;P&gt;on the tool bar.&amp;nbsp; This will open a dialog box with a pull-down control and an edit box.&lt;BR /&gt;
	Click the pull-down and select Fortran (or C++ if posting C++ code), then paste the contents of the clip board into the edit box.&lt;/P&gt;

&lt;P&gt;Doing this is quicker than uploading a screenshot, .AND. will permit the readers to copy and paste your source in formulating a response. You can also paste the complete loop.&lt;/P&gt;

&lt;P&gt;Gregg reply #3 interacts with your program in a behind-the-scenes manner. In a NUMA setup, the performance gains can only be attained by carefully managing your allocations and deallocations. Preferably, the allocations are to be reused by the same thread .OR. the same sections of the allocations are first touched and reused by the same thread.&lt;/P&gt;

&lt;P&gt;The code you have shown, allocates (and presumably deallocates) rele_surf and weight n_va times _outside_ the parallel region by the main thread, then upon entry to the parallel region, for each of&amp;nbsp;the additional threads of the team, allocates an additional n_va times. The likelihood of each of the thread allocations getting the same virtual addresses on each of the n_va * 6 (* number of threads) is virtually nil.&lt;/P&gt;

&lt;P&gt;Consider making rele_surf and weight n_va module allocatable arrays and threadprivate, not allocated, then allocate each to the max size required. Sketch:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;module threadprivate_data
&amp;nbsp;&amp;nbsp;&amp;nbsp; real, allocatable :: rele_surf(:), weight(:)
&amp;nbsp;&amp;nbsp;&amp;nbsp; !$omp threadprivate(rele_surf, weight)
end module threadprivate_data
&amp;nbsp;&amp;nbsp;&amp;nbsp; 
program your_program
&amp;nbsp;&amp;nbsp;&amp;nbsp; use threadprivate_data
&amp;nbsp;&amp;nbsp;&amp;nbsp; implicit none
&amp;nbsp;&amp;nbsp;&amp;nbsp; ...&amp;nbsp;&amp;nbsp;&amp;nbsp; 
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! once only code
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! after you know the working sizes
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! compute the largest allocation size
&amp;nbsp;&amp;nbsp;&amp;nbsp; max_max_grid_len3 = 0
&amp;nbsp;&amp;nbsp;&amp;nbsp; do i_va = 1, n_va
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; max_max_grid_len3 = max(max_max_grid_len3, va(i)%max_grid_len3)
&amp;nbsp;&amp;nbsp;&amp;nbsp; end do
&amp;nbsp;&amp;nbsp;&amp;nbsp; !$omp parallel
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ! each thread allocates (only once) the working data arrays
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; allocate(rele_surf(max_max_grid_len3), stat=status)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(status .ne. 0) stop
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; allocate(weight(max_max_grid_len3), stat=status)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(status .ne. 0) stop
&amp;nbsp;&amp;nbsp;&amp;nbsp; !$omp end parallel
&amp;nbsp;&amp;nbsp;&amp;nbsp; ...
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! your code follows
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! *** remove the allocation/deallocation
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! *** remove the private(rele_surf,weight)
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! *** do not use size(rele_surf) or size(weight)
&amp;nbsp;&amp;nbsp;&amp;nbsp; ! *** use copy of va(i_va)%max_grid_len3 instead
&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2016 13:46:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Different-perfermance-on-PC-and-cluster-of-Fortran-OpenMP-code/m-p/1081577#M7102</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-08-04T13:46:42Z</dc:date>
    </item>
  </channel>
</rss>

