<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Best hardware for OpenMP application in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911865#M83597</link>
    <description>&lt;P&gt;We are considering purchasing a computer to run very large parallelized applications based on OpenMP. We are looking for optimal configuration based on processor (e.g I7 vs Xeon), Cache memory, number of processors, etc.&lt;/P&gt;
&lt;P&gt;Any suggestion or experiences you can share would be highly appreciated.&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;R&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 15 Jan 2010 16:26:51 GMT</pubDate>
    <dc:creator>Reinaldo_Garcia</dc:creator>
    <dc:date>2010-01-15T16:26:51Z</dc:date>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911865#M83597</link>
      <description>&lt;P&gt;We are considering purchasing a computer to run very large parallelized applications based on OpenMP. We are looking for optimal configuration based on processor (e.g I7 vs Xeon), Cache memory, number of processors, etc.&lt;/P&gt;
&lt;P&gt;Any suggestion or experiences you can share would be highly appreciated.&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;R&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Jan 2010 16:26:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911865#M83597</guid>
      <dc:creator>Reinaldo_Garcia</dc:creator>
      <dc:date>2010-01-15T16:26:51Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911866#M83598</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The distinction between Core i7 and Xeon is simply the number of CPU packages. Core i7 is excellent for supporting up to 4 threads (8, if your application uses HyperThreading effectively); many support 12GB RAM routinely, or even 24GB. Most Xeon 55xx provide 8 cores, and could support 16 threads with HyperThreading, with 48GB RAM not an unusual configuration.&lt;/P&gt;
&lt;P&gt;Within a few months, corresponding Westmere models with 12 cores will be available. At that time, for an application to be considered large, it might require the Nehalem-EX model with 4 CPU packages of 6 or 8 cores each, with 128 or 256GB RAM. For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.&lt;/P&gt;</description>
      <pubDate>Fri, 15 Jan 2010 17:20:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911866#M83598</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-01-15T17:20:40Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911867#M83599</link>
      <description>&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=425568" class="basic" href="https://community.intel.com/en-us/profile/425568/"&gt;Reinaldo Garcia&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;&lt;/DIV&gt;
&lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.&lt;/P&gt;
&lt;/I&gt;&lt;/DIV&gt;
I hear you there. Is it reasonable to assume that each of the above configurations has its own, unique memory locality issues? Is it further reasonable to assume that these issues can be addressed through judicious setting of environment variables? Or, asked differently, is the choice of CPU configuration not especially important with respect to Open MP implementation AS LONG AS the memory locality issues are appropriately addressed?  PS the new forum software still needs work; I am replying to tim18 but it says I'm quoting the OP (which I could edit manually but that's not the point).</description>
      <pubDate>Fri, 15 Jan 2010 20:15:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911867#M83599</guid>
      <dc:creator>peterklaver</dc:creator>
      <dc:date>2010-01-15T20:15:12Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911868#M83600</link>
      <description>&lt;P&gt;Thanks so much for your helpful insight. I have a question regarding your remark about the memory issues when using larger numbers of cores. Does it mean that the same code that runs OK on a quad core processor would need to be modified to run on a larger system with multiple processors?&lt;/P&gt;
&lt;P&gt;Thanks again,&lt;/P&gt;
&lt;P&gt;R&lt;/P&gt;</description>
      <pubDate>Sat, 16 Jan 2010 14:35:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911868#M83600</guid>
      <dc:creator>Reinaldo_Garcia</dc:creator>
      <dc:date>2010-01-16T14:35:29Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911869#M83601</link>
      <description>&lt;P&gt;If a great deal of foresight has been given to organization, or the application is simple enough, it's certainly possible that an OpenMP application designed for a quad core CPU will scale automatically, even to 4 8-core CPUs, with setting of environment variables (KMP_AFFINITY for Intel OpenMP).&lt;/P&gt;
&lt;P&gt;On the other hand, an application which works well on a single multi-core CPU in spite of false sharing, or even latent race conditions, is likely to be a disaster already on a dual CPU.&lt;/P&gt;
&lt;P&gt;In an example I discussed earlier, simple use of OpenMP schedule(guided) worked well to balance work between threads only up to 8 threads, but it was possible to scale effectively to at least 24 cores by writing in specific load balancing to keep each chunk local to the same core (and memory bank).&lt;/P&gt;</description>
      <pubDate>Sat, 16 Jan 2010 15:12:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911869#M83601</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-01-16T15:12:48Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911870#M83602</link>
      <description>&lt;P&gt;Reinaldo,&lt;/P&gt;
&lt;P&gt;Much depends on the characteristics of your applications as well when chosing a large system. Some application domainshave very little locality and large data sets,so memory bandwidth is crucial. For these applications, Xeon processors with 3-channel DDR3 would be best to maximize memory bandwidth. Other applications have inherently high data locality, so for these the memory system is less important (2-channel DDR3 is fine), but the size of the on-chip caches (especially the largest level) should be maximized. Hope this helps a bit!&lt;/P&gt;
&lt;P&gt;- Grant&lt;/P&gt;</description>
      <pubDate>Fri, 12 Feb 2010 16:50:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911870#M83602</guid>
      <dc:creator>Grant_H_Intel</dc:creator>
      <dc:date>2010-02-12T16:50:16Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911871#M83603</link>
      <description>&lt;P&gt;Our applications are computationally and memory intensive finite element models. Typically runs take from several hours to several days on a I7-920 processor. Parallelizing the code with OpenMP has made a huge difference on these processors, but I was wondering how much better performance we could expect using2 Xeon processors instead of on I7.
&lt;/P&gt;&lt;P&gt;Thanks
&lt;/P&gt;&lt;P&gt;
&lt;/P&gt;&lt;P&gt;R//G
&lt;/P&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Mar 2010 03:02:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911871#M83603</guid>
      <dc:creator>Reinaldo_Garcia</dc:creator>
      <dc:date>2010-03-03T03:02:53Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911872#M83604</link>
      <description>Results from &lt;A href="http://topcrunch.org" target="_blank"&gt;http://topcrunch.org&lt;/A&gt; typically show Xeon 5560 (dual quad core) platforms giving 70% more performance than Core I7. On some of these applications, the 6 core CPUs give an additional 25%. Note that most results on that site are for MPI applications, which may gain more than OpenMP from additional sockets.</description>
      <pubDate>Wed, 03 Mar 2010 03:31:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911872#M83604</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-03-03T03:31:44Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911873#M83605</link>
      <description>&lt;P&gt;Reinaldo,&lt;/P&gt;
&lt;P&gt;If your FE has memory bandwidth issues (as opposed to memory capacity issues), and if you wish to address this with NUMA node locality, then you must invest some time in partitioning the data such that parts are allocated in a distributed manner accross each NUMA node (and processed by HT threads from within those respective nodes). Using Structures Of Arrays layout can help too as they can take better advantage of SSE capabilities in the processor.&lt;/P&gt;
&lt;P&gt;e.g. if your code is currently setup with objects (nodes) containing property variables, say Vec3 ::pos then for Structures Of Arrays layout you do not allocate objects (nodes), Rather you allocate arrays of your property variables (pos, vel, acc, force, ...).&lt;/P&gt;
&lt;P&gt;However, on NUMA platform with NUMA allocation, you would pre-slice the property variable arrays by number of NUMA nodes (2, 4, 8...) and allocate those slice domains within those nodde. Then construct your processing loops such that each slice has preferential (or exclusive) processing by threads within the NUMA node.&lt;/P&gt;
&lt;P&gt;This requires more programming work, but once you do this for one major loop, it becomes close to a cut and paste operation to setup your next major loop&lt;/P&gt;
&lt;P&gt;Some of my FE runs (Space Elevator simulations) would take weeks to complete.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 03 Mar 2010 14:10:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911873#M83605</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2010-03-03T14:10:40Z</dc:date>
    </item>
    <item>
      <title>Best hardware for OpenMP application</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911874#M83606</link>
      <description>&lt;P&gt;Yes, NUMA locality is among the factors which require attention for OpenMP to give full effectiveness on Xeon 55xx or Opteron platforms.&lt;/P&gt;
&lt;P&gt;If the data are laid out so as to promote SSE vectorization, if it is possible to arrange the first touch (e.g. when arrays are initialized) to be done by an OpenMP parallel loop of the same structure as the working loops, so that access to each section of the array is always from the same processor, and sharing of cache lines between processors is minimized, that should take care of it. Needless to say, this may be difficult to arrange in an existing application which was not designed for locality.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Mar 2010 17:47:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Best-hardware-for-OpenMP-application/m-p/911874#M83606</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-03-03T17:47:06Z</dc:date>
    </item>
  </channel>
</rss>

