<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Large array in two socket system in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1548302#M169505</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I wonder if there is a more efficient way of doing the following than just leaving ifort to decide.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In a two socket system (2x E5-2690 v4&amp;nbsp; and 256GB over 8 slots) one has a huge array (say) Double Precision&amp;nbsp; A(1000,1000,4000) (J,K,I). What would be that most efficient way of doing&lt;/P&gt;&lt;P&gt;Do I = 1, 4000&lt;/P&gt;&lt;P&gt;&amp;nbsp;Do K =1, 1000&lt;/P&gt;&lt;P&gt;&amp;nbsp;Do J=1, 1000&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; A(J,K,I) = calculations calling A(J,K,I)&lt;/P&gt;&lt;P&gt;End Do&lt;/P&gt;&lt;P&gt;End Do&lt;/P&gt;&lt;P&gt;End Do&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Using OMP, how one minimizes qpi transfers?&lt;/P&gt;&lt;P&gt;Thanks for any suggestions.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 28 Nov 2023 13:19:51 GMT</pubDate>
    <dc:creator>a_b_1</dc:creator>
    <dc:date>2023-11-28T13:19:51Z</dc:date>
    <item>
      <title>Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1548302#M169505</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I wonder if there is a more efficient way of doing the following than just leaving ifort to decide.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In a two socket system (2x E5-2690 v4&amp;nbsp; and 256GB over 8 slots) one has a huge array (say) Double Precision&amp;nbsp; A(1000,1000,4000) (J,K,I). What would be that most efficient way of doing&lt;/P&gt;&lt;P&gt;Do I = 1, 4000&lt;/P&gt;&lt;P&gt;&amp;nbsp;Do K =1, 1000&lt;/P&gt;&lt;P&gt;&amp;nbsp;Do J=1, 1000&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; A(J,K,I) = calculations calling A(J,K,I)&lt;/P&gt;&lt;P&gt;End Do&lt;/P&gt;&lt;P&gt;End Do&lt;/P&gt;&lt;P&gt;End Do&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Using OMP, how one minimizes qpi transfers?&lt;/P&gt;&lt;P&gt;Thanks for any suggestions.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Nov 2023 13:19:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1548302#M169505</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2023-11-28T13:19:51Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1548392#M169519</link>
      <description>&lt;P&gt;Consider using the KMP_AFFINITY environment variable to pin sections of an array to a specific logical processor (or logical processors).&lt;/P&gt;&lt;P&gt;Depending on your calculations (not just this loop but throughout the application) you may wish to use or not use HT siblings.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For the above loop (but not necessarily the complete application)&lt;/P&gt;&lt;P&gt;one thread per core&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; KMP_AFFINITY=granularity=core,compact&lt;/P&gt;&lt;P&gt;two threads per core&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; KMP_AFFINITY=granularity=thread,compact&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then use static scheduling on the outer loop only&lt;/P&gt;&lt;LI-CODE lang="fortran"&gt;!$omp parallel do private(i,j,k)
do i=1,4000
...
end do
!$omp end parallel do&lt;/LI-CODE&gt;&lt;P&gt;Note, you may want to experiment with exchanging "compact" with "scatter", however, I think compact would be better.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, if you know other services and/or applications are running on the system, you may want (need) to specify lesser than full set of system threads to your application.. &lt;A href="https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-2/thread-affinity-interface.html" target="_self"&gt;KMP_AFFINITY&lt;/A&gt;=proclist={&amp;lt;proc-list&amp;gt;}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 28 Nov 2023 16:47:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1548392#M169519</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2023-11-28T16:47:05Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549014#M169605</link>
      <description>&lt;P&gt;Thank you.&lt;/P&gt;&lt;P&gt;The use of affinity in this case hangs on knowing how the data is spread across the shared memory so as to arrange for the execution of each thread nearest to its data.&amp;nbsp; I am not clear how this can be achieved.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 02:40:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549014#M169605</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2023-11-30T02:40:58Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549081#M169610</link>
      <description>&lt;P&gt;A way to do that - it is a trick I learned about a few years ago, but never had a chance to actually use properly - is to initialise the arrays in question in an explicit OpenMP loop. Not via an array operation, but a plain classical loop.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 08:28:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549081#M169610</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-11-30T08:28:04Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549172#M169614</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/75329"&gt;@Arjen_Markus&lt;/a&gt;&amp;nbsp;Thanks for pointing this out, I overlooked the NUMA first touch.&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/68289"&gt;@a_b_1&lt;/a&gt;&amp;nbsp;When you perform a "first touch", have the OpenMP loop team be the same as the team processing the data later on. Use static scheduling.&lt;/P&gt;&lt;P&gt;"first touch":&lt;/P&gt;&lt;P&gt;When a heap allocation is made using virtual addresses never before been used by the process, the mapping to physical RAM is not made until a first touch of a memory location within each page of the virtual memory. On a single socket system this doesn't mean much (other than only allocating a page from the page file for its use should it get paged out). On a multi-socket system .AND. where the BIOS is configured to non-interleaved memory access, each socket has a dedicated selection of RAM sticks. This presents a NUMA configuration. In this case, you would want the hardware thread that first touches the location (after allocation) to be the same thread that processes the data later on. Note, if your system BIOS has configured the memory for interleaved operation (quazi-UMA), then there will be no first touch benefit.&lt;/P&gt;&lt;P&gt;Note 2: I've observed in the past that many of the Motherboard and BIOS developers are non-english speaking and often invert the meaning of interleaved. Meaning the BIOS selection for memory access may be opposite from what you read.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 14:56:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549172#M169614</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2023-11-30T14:56:22Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549191#M169615</link>
      <description>&lt;P&gt;Thank you very much both.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This makes loads of sense. I can get on with experimenting. I use 4-gpu acceleration as well and knowing where the data is will be definitely useful.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I use an HP server so checking for the interleaved memory setting should be ok.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 15:44:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549191#M169615</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2023-11-30T15:44:29Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549269#M169624</link>
      <description>&lt;P&gt;For NUMA access, you do not want interleaved memory.&lt;/P&gt;&lt;P&gt;Interleaved memory configures the memory such that each successive cache line width load comes from the&amp;nbsp; memory attached to each successive CPU socket. This gives you balanced access. Which is not necessarily the best configuration for HPC environments. NUMA access (non-interleaved) is better when your applications are coded to be affinity aware .and. process the data in an affinity managed manner.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your server administrator (if it is not you) may have their own preferred configuration that differs from yours.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 18:06:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549269#M169624</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2023-11-30T18:06:15Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549572#M169659</link>
      <description>&lt;P&gt;Can one assign sections to specific cores?&lt;/P&gt;</description>
      <pubDate>Fri, 01 Dec 2023 12:08:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549572#M169659</guid>
      <dc:creator>a_b_1</dc:creator>
      <dc:date>2023-12-01T12:08:52Z</dc:date>
    </item>
    <item>
      <title>Re: Large array in two socket system</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549596#M169660</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&lt;SPAN&gt;Can one assign sections to specific cores?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Not directly. The sections of the array have the granularity of the virtual memory page size. Whichever hardware thread's socket's CPU touches the memory of that page first gets that section of the array. Note, this is first touch since process start, not from subsequent deallocation followed by allocation.&lt;/P&gt;&lt;P&gt;Your only controls are:&lt;/P&gt;&lt;P&gt;1) Assure that your software threads are pinned to hardware threads&lt;/P&gt;&lt;P&gt;2) Align sensitive arrays on page size boundaries. This is generally 4KB, but can differ. There is a system call to obtain the page size.&lt;/P&gt;&lt;P&gt;3) Control your loops such that the same threads process the same sections of data. Note, it is seldom likely to have multi-dimension array to be partitioned at page boundaries.&amp;nbsp; IOW the start and end point of the section might be in a page first touched by a neighboring thread. The cost of trying to work around this (e.g. padding dimensions) can outweigh the benefit (with an exception for padding for SIMD alignment and multiples of the left most array index).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I suggest that you identify the array(s) and section(s) of code that are heavily compute bound. Then produce a test program using that/those array sizes in a timed loop. Run time set to at least 2 minutes at least a few 10's of iterations of the timed section. Gather statistics: Total time, fastest iteration time, slowest iteration time.&lt;/P&gt;&lt;P&gt;With this setup, you can then vary test runs with permutations of:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;NUMA enabled&lt;/P&gt;&lt;P&gt;NUMA disabled&lt;/P&gt;&lt;P&gt;Arrays aligned to 4KB&lt;/P&gt;&lt;P&gt;Arrays not aligned&lt;/P&gt;&lt;P&gt;Threads affinity pinned&lt;/P&gt;&lt;P&gt;Threads not affinity pinned&lt;/P&gt;&lt;P&gt;Placement of threads (compact, scatter, ... or specific placements)&lt;/P&gt;&lt;P&gt;Using 1 thread per core&lt;/P&gt;&lt;P&gt;Using 2 threads per core&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note, affinity pinning of software threads can improve cache hit probabilities. And, some of the newer CPU's have different types of cores: Efficiency and Performance. So, this may factor into your testing.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now, if this is too much work, then do not worry about optimization.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 01 Dec 2023 13:57:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Large-array-in-two-socket-system/m-p/1549596#M169660</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2023-12-01T13:57:38Z</dc:date>
    </item>
  </channel>
</rss>

