<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pentium 4 Large Page questions in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929802#M2547</link>
    <description>Lets consider the following pseudo code. &lt;BR /&gt;&lt;BR /&gt;for (it=0;it&amp;lt; itermax; it++){&lt;BR /&gt;&lt;BR /&gt;for (i=1;i&amp;lt; loopcount;i++){&lt;BR /&gt;for (j=1;j&amp;lt; loopcount;j++){&lt;BR /&gt;for (k=1;k&amp;lt; loopcount;k++){&lt;BR /&gt;&lt;BR /&gt;	Ut&lt;I&gt;&lt;J&gt; = factor* ( U[i+1]&lt;J&gt;&lt;K&gt;&lt;BR /&gt;		+U[i-1] &lt;J&gt; &lt;K&gt;&lt;BR /&gt;		+U&lt;I&gt; [j+1] &lt;K&gt;&lt;BR /&gt;		+U&lt;I&gt; [j-1] &lt;K&gt;&lt;BR /&gt;		+U&lt;I&gt; &lt;J&gt; [k+1]&lt;BR /&gt;		+U&lt;I&gt; &lt;J&gt; [k-1]&lt;BR /&gt;		-dh2 * F&lt;I&gt; &lt;J&gt; &lt;K&gt;);&lt;BR /&gt;    }&lt;BR /&gt;    }&lt;BR /&gt;    }&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;So the stride is one for every stream I need and I&lt;BR /&gt;need 7 streams altogether. &lt;BR /&gt;In the actual update I access 5 different pages.&lt;BR /&gt;These 5 pages are the same inside for around one&lt;BR /&gt;line in the 3D grid. There may be around 65536 lines.&lt;BR /&gt;In every new line I access 5 new pages.&lt;BR /&gt;This thing gets even worse if you apply loop&lt;BR /&gt;blocking techniques were the data paths get more complicated&lt;BR /&gt;and the pages you access increase.&lt;BR /&gt;&lt;BR /&gt;The above numbers  are only a guess. I have some &lt;BR /&gt;performance counter numbers for 2D where the TLB&lt;BR /&gt;misses decrease by a factor of around 800 by using&lt;BR /&gt;large pages.&lt;BR /&gt;&lt;BR /&gt;Also I still want to show that large pages can give you &lt;BR /&gt;an advantage. I did preliminary experiments for in&lt;BR /&gt;cache vector triad. with 4k pages you can recognize&lt;BR /&gt;a smear out of the performance for sizes larger than 256k.&lt;BR /&gt;So the performance doesnt decrease sharply when you drop&lt;BR /&gt;out of cache as you might expect but it degrades earlier.&lt;BR /&gt;This is clearly caused by the small D TLB. &lt;BR /&gt;With large pages performance stays constant for the whole&lt;BR /&gt;cache size, unfortunatly at a lower level.&lt;BR /&gt;&lt;BR /&gt;Anyway, thank you for your help.&lt;BR /&gt;&lt;BR /&gt;If you are interested I can give you the results of our&lt;BR /&gt;tests, this time backed by measurements, after we have &lt;BR /&gt;finished.&lt;BR /&gt;&lt;BR /&gt;Jan&lt;/K&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/K&gt;&lt;/I&gt;&lt;/K&gt;&lt;/I&gt;&lt;/K&gt;&lt;/J&gt;&lt;/K&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;</description>
    <pubDate>Fri, 05 Nov 2004 03:56:45 GMT</pubDate>
    <dc:creator>moebiusband</dc:creator>
    <dc:date>2004-11-05T03:56:45Z</dc:date>
    <item>
      <title>Pentium 4 Large Page questions</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929797#M2542</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;we develope cache aware iterative solvers.&lt;BR /&gt;This group of algorithms is mainly memory &lt;BR /&gt;access limited. With 3D problems we face&lt;BR /&gt;an increase in Data TLB misses. One solution&lt;BR /&gt;seemed to be the usage of large (4M) pages.&lt;BR /&gt;&lt;BR /&gt;We recognized several strange effects by switching&lt;BR /&gt;to large pages. As bottomline: D-TLB misses and L2&lt;BR /&gt;DCM dropped, but performance in terms of runtime &lt;BR /&gt;increased, roughly by a factor of 2.&lt;BR /&gt;&lt;BR /&gt;To pin down the problem I did two tests:&lt;BR /&gt;1) I have several assembler implementations of memcpy. &lt;BR /&gt;One uses software prefetching (mov in register). The&lt;BR /&gt;version with "hand prefetching" showed exactly the same&lt;BR /&gt;performance for 4k and 4M pages, while the standard &lt;BR /&gt;version showed a large decrease.&lt;BR /&gt;&lt;BR /&gt;2) I checked the vector triad with an SSE2 assembler&lt;BR /&gt;implementation. So in this case prefetching should play&lt;BR /&gt;no role. With 4M pages I get rougly half of the performance&lt;BR /&gt;than with 4k pages.&lt;BR /&gt;&lt;BR /&gt;I have two questions:&lt;BR /&gt;&lt;BR /&gt;* Is the hardware prefetching disabled for large pages?&lt;BR /&gt;&lt;BR /&gt;* Is there any issue with SSE2 instructions and large pages?&lt;BR /&gt;&lt;BR /&gt;Is there any other point I didnt recognize?&lt;BR /&gt;&lt;BR /&gt;Thanks in advance for you help,&lt;BR /&gt;&lt;BR /&gt;Jan Treibig&lt;BR /&gt;&lt;BR /&gt;PS:&lt;BR /&gt;Just for completeness: &lt;BR /&gt;The codes are exactly the same for different page sizes.&lt;BR /&gt;We use the mmap call to allocate memory &lt;BR /&gt;on a hugetlbfs on linux. To use it,  we overwrite &lt;BR /&gt;malloc and LD_PRELOAD the implementing library.&lt;BR /&gt;&lt;BR /&gt;The operating system is Linux with a 2.6.5 Kernel. &lt;BR /&gt;As all benchmarks are written in assembler the compiler&lt;BR /&gt;is no issue. But this effects can also be seen with C Code.</description>
      <pubDate>Wed, 03 Nov 2004 20:17:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929797#M2542</guid>
      <dc:creator>moebiusband</dc:creator>
      <dc:date>2004-11-03T20:17:36Z</dc:date>
    </item>
    <item>
      <title>Re: Pentium 4 Large Page questions</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929798#M2543</link>
      <description>I'll try to be brief, as my previous replies have been deleted.&lt;BR /&gt;I think that page size of 16K or 64K would have a better chance of helping out with the concerns you expressed.  A 4M page size appears to be attempting to improve performance only for large stride access within arrays of over 10MB, at the expense of other factors. &lt;BR /&gt;You raise an interesting poing about hardware prefetch.  If P4 hardware prefetch reach is limited to 4K, maybe the hoped for advantage of larger pages is defeated.&lt;BR /&gt;I suspect there will be more need for TLB miss mitigation in the 64-bit OS, and with larger cache.  Wouldn't those systems be more interesting in the future for solving problems such as you mention?&lt;P&gt;Message Edited by tim18 on &lt;SPAN class="date_text"&gt;11-04-2004&lt;/SPAN&gt; &lt;SPAN class="time_text"&gt;04:51 AM&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Nov 2004 20:28:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929798#M2543</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2004-11-04T20:28:36Z</dc:date>
    </item>
    <item>
      <title>Re: Pentium 4 Large Page questions</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929799#M2544</link>
      <description>Hi,&lt;BR /&gt;I missed some important point:&lt;BR /&gt;For the vector triad the prefetching plays no role because &lt;BR /&gt;I adjusted the size, so that it runs completely in cache.&lt;BR /&gt;&lt;BR /&gt;About the System:&lt;BR /&gt;It is a Pentium 4 Northwood 2.8 GHz&lt;BR /&gt;on a i865 Board with dual channel DDR400 Ram.</description>
      <pubDate>Thu, 04 Nov 2004 20:45:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929799#M2544</guid>
      <dc:creator>moebiusband</dc:creator>
      <dc:date>2004-11-04T20:45:37Z</dc:date>
    </item>
    <item>
      <title>Re: Pentium 4 Large Page questions</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929800#M2545</link>
      <description>Hi,&lt;BR /&gt;thanks for your answer.&lt;BR /&gt;&lt;BR /&gt;We have large strides. Consider a typical 3D&lt;BR /&gt;simulation. Typical sizes of the 3D arrays used are&lt;BR /&gt;e.g. 256x256x256 points, that are around 260MB for &lt;BR /&gt;each array. If you used stencil based codes with &lt;BR /&gt;regular grids you access &lt;BR /&gt;[i(+-1)][j(+-)][k(+-)]. So for every point update &lt;BR /&gt;you access five different pages. And you update&lt;BR /&gt;each point several times. As you may aggree this&lt;BR /&gt;produces lots of TLB misses. &lt;BR /&gt;&lt;BR /&gt;You speak of expense of other factors, as I use &lt;BR /&gt;large pages only for the large arrays memory&lt;BR /&gt;fragmentation should be no issue, especially&lt;BR /&gt;with multiple page size support, as present on linux.&lt;BR /&gt;Is there any other factor I miss?&lt;BR /&gt;&lt;BR /&gt;So using large pages is sensible and should increase &lt;BR /&gt;performance.&lt;BR /&gt;&lt;BR /&gt;You didnt ask my questions. The behaviour of the P4&lt;BR /&gt;CPU with large pages is not documented. There are&lt;BR /&gt;some words about 4k restrictions in the optimization&lt;BR /&gt;handbook, but it is never explicitly mentioned with&lt;BR /&gt;regard to large pages. Do you think the 4k page&lt;BR /&gt;boundary issue with the prefetcher causes the problem?&lt;BR /&gt;It sould be interesting to try this with the Prescott, as&lt;BR /&gt;its prefetcher is not anymore limited to the 4k boundary.&lt;BR /&gt;&lt;BR /&gt;To answer you last question:&lt;BR /&gt;TLB misses are a problem in scientific applications now. &lt;BR /&gt;So this is independant of 64bit. If the 64bit address space&lt;BR /&gt;is necessary? It depends on the problem, but of course there &lt;BR /&gt;are many problems, which are limited by the ammount of &lt;BR /&gt;memory available.&lt;BR /&gt;&lt;BR /&gt;Jan Treibig&lt;P&gt;Message Edited by moebiusband on &lt;SPAN class="date_text"&gt;11-04-2004&lt;/SPAN&gt; &lt;SPAN class="time_text"&gt;06:01 AM&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Nov 2004 20:46:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929800#M2545</guid>
      <dc:creator>moebiusband</dc:creator>
      <dc:date>2004-11-04T20:46:55Z</dc:date>
    </item>
    <item>
      <title>Re: Pentium 4 Large Page questions</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929801#M2546</link>
      <description>Your stencil based code should have no difficulty with DTLB, if your inner loop is over the stride 1 subscript, as you must do for any opportunity of vectorization.  There should be no problem with having 5 pages active, if the same 5 pages are in use over a large number of inner loop iterations. &lt;BR /&gt;If you must loop over the largest stride subscript, you will certainly have problems with DTLB miss, at any reasonable page size.  For looping over the middle subscript, I would not be surprised to see trouble, but would hope something could be done with page size.&lt;BR /&gt;I agree that we have important un-answered questions.  I have some evidence it may not be independent of 64-bit OS, it may become a more important problem there.  I don't disagree that it may already be a problem in the 32-bit OS.</description>
      <pubDate>Fri, 05 Nov 2004 01:17:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929801#M2546</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2004-11-05T01:17:53Z</dc:date>
    </item>
    <item>
      <title>Re: Pentium 4 Large Page questions</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929802#M2547</link>
      <description>Lets consider the following pseudo code. &lt;BR /&gt;&lt;BR /&gt;for (it=0;it&amp;lt; itermax; it++){&lt;BR /&gt;&lt;BR /&gt;for (i=1;i&amp;lt; loopcount;i++){&lt;BR /&gt;for (j=1;j&amp;lt; loopcount;j++){&lt;BR /&gt;for (k=1;k&amp;lt; loopcount;k++){&lt;BR /&gt;&lt;BR /&gt;	Ut&lt;I&gt;&lt;J&gt; = factor* ( U[i+1]&lt;J&gt;&lt;K&gt;&lt;BR /&gt;		+U[i-1] &lt;J&gt; &lt;K&gt;&lt;BR /&gt;		+U&lt;I&gt; [j+1] &lt;K&gt;&lt;BR /&gt;		+U&lt;I&gt; [j-1] &lt;K&gt;&lt;BR /&gt;		+U&lt;I&gt; &lt;J&gt; [k+1]&lt;BR /&gt;		+U&lt;I&gt; &lt;J&gt; [k-1]&lt;BR /&gt;		-dh2 * F&lt;I&gt; &lt;J&gt; &lt;K&gt;);&lt;BR /&gt;    }&lt;BR /&gt;    }&lt;BR /&gt;    }&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;So the stride is one for every stream I need and I&lt;BR /&gt;need 7 streams altogether. &lt;BR /&gt;In the actual update I access 5 different pages.&lt;BR /&gt;These 5 pages are the same inside for around one&lt;BR /&gt;line in the 3D grid. There may be around 65536 lines.&lt;BR /&gt;In every new line I access 5 new pages.&lt;BR /&gt;This thing gets even worse if you apply loop&lt;BR /&gt;blocking techniques were the data paths get more complicated&lt;BR /&gt;and the pages you access increase.&lt;BR /&gt;&lt;BR /&gt;The above numbers  are only a guess. I have some &lt;BR /&gt;performance counter numbers for 2D where the TLB&lt;BR /&gt;misses decrease by a factor of around 800 by using&lt;BR /&gt;large pages.&lt;BR /&gt;&lt;BR /&gt;Also I still want to show that large pages can give you &lt;BR /&gt;an advantage. I did preliminary experiments for in&lt;BR /&gt;cache vector triad. with 4k pages you can recognize&lt;BR /&gt;a smear out of the performance for sizes larger than 256k.&lt;BR /&gt;So the performance doesnt decrease sharply when you drop&lt;BR /&gt;out of cache as you might expect but it degrades earlier.&lt;BR /&gt;This is clearly caused by the small D TLB. &lt;BR /&gt;With large pages performance stays constant for the whole&lt;BR /&gt;cache size, unfortunatly at a lower level.&lt;BR /&gt;&lt;BR /&gt;Anyway, thank you for your help.&lt;BR /&gt;&lt;BR /&gt;If you are interested I can give you the results of our&lt;BR /&gt;tests, this time backed by measurements, after we have &lt;BR /&gt;finished.&lt;BR /&gt;&lt;BR /&gt;Jan&lt;/K&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/K&gt;&lt;/I&gt;&lt;/K&gt;&lt;/I&gt;&lt;/K&gt;&lt;/J&gt;&lt;/K&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;</description>
      <pubDate>Fri, 05 Nov 2004 03:56:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Pentium-4-Large-Page-questions/m-p/929802#M2547</guid>
      <dc:creator>moebiusband</dc:creator>
      <dc:date>2004-11-05T03:56:45Z</dc:date>
    </item>
  </channel>
</rss>

