<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Interesting results.... in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112479#M6050</link>
    <description>&lt;P&gt;Interesting results....&lt;/P&gt;

&lt;P&gt;Notice that in the win32 case the output of KMP_AFFINITY includes "Initial OS proc set respected", while for the 64-bit version the KMP_AFFINITY message is "Initial OS proc set not respected".&amp;nbsp;&amp;nbsp;&amp;nbsp; I don't know why there should be a difference, but you can override the behavior in the win32 case by adding the "norespect" option to KMP_AFFINITY:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;KMP_AFFINITY=verbose,scatter,norespect&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;This may not help (there may be some other reason that win32 only sees one socket), but it is a quick and easy test....&lt;/P&gt;</description>
    <pubDate>Tue, 13 Dec 2016 23:24:02 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2016-12-13T23:24:02Z</dc:date>
    <item>
      <title>openMP difference 32 vs 64-bit</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112474#M6045</link>
      <description>&lt;P&gt;Dear all,&lt;/P&gt;

&lt;P&gt;Not sure whether this is the right sub-forum but here is my problem:&lt;/P&gt;

&lt;P&gt;I am compiling an intel fortran compiled program (w_comp_lib_2016.1.146) on my windows 7 machine with Visual Studio 2013 (I am compiling both a 32 and 64-bit version). The program contains OpenMP instructions to parallelize the most intensive do loop using:&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP PARALLEL&amp;nbsp; private(..)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP DO&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;do ....&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; do ..&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; ...&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; endo&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; endo&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP END DO&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP END PARALLEL&lt;/P&gt;

&lt;P&gt;I am running the program on a high performance computer (Windows Server 2012R2) containing 2 nodes with each 20 CPU's (Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz). Where the 64-bit program uses all available CPU's from both nodes (as observed from the resource monitor), the 32-bit application only uses one node and thus is significantly slower. The RAM usage is limited to approximately 50Mb.&lt;/P&gt;

&lt;P&gt;Any clue on what is causing the 32-bit application to use only 1 node? I could not find an answer after searching the forum.&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2016 10:39:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112474#M6045</guid>
      <dc:creator>Windyman</dc:creator>
      <dc:date>2016-12-13T10:39:13Z</dc:date>
    </item>
    <item>
      <title>As you are using ifort, the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112475#M6046</link>
      <description>&lt;P&gt;As you are using ifort, the Windows Fortran forum may be more useful.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;It appears you have missed a few points in your forum searches.&lt;/P&gt;

&lt;P&gt;1) "all CPUs" isn't a very suitable term if you mean all logical CPUs.&amp;nbsp; Most Fortran applications perform better with OMP_NUM_THREADS set to number of cores, and 'SET OMP_PLACES=cores'.&amp;nbsp; You seem to imply you haven't tried these settings.&amp;nbsp; If the settings aren't acceptable, you should try to disable HyperThreading.&lt;/P&gt;

&lt;P&gt;2) default value of OMP_STACKSIZE is 2MB for 32-bit mode and 4MB for 64-bit.&amp;nbsp; I don't know a way in which default OMP_NUM_THREADS would be adjusted accordingly.&amp;nbsp; It's easy to run out of stack space when you run so many threads.&amp;nbsp; If you increase OMP_STACKSIZE, you may expect to require also a boost in the link stack setting.&lt;/P&gt;

&lt;P&gt;3) (not directly related to your question) on account of the limited address space in 32-bit mode, ifort skips some of the optimizations which it performs at the same settings for 64-bit mode.&amp;nbsp; Some of these differences may show up when you set /Qopt-report:4 which you should have read about in your forum search.&lt;/P&gt;

&lt;P&gt;4) It seems unlikely that anyone would want to run in 32-bit mode on such a platform.&amp;nbsp; If you wish to pursue this question, you may need to explain your reasons.&lt;/P&gt;

&lt;P&gt;5) Your statement about 50Mb seems ambiguous.&amp;nbsp; If you do mean bits, that seems rather small.&amp;nbsp; How do you get this figure?&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2016 12:40:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112475#M6046</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-12-13T12:40:43Z</dc:date>
    </item>
    <item>
      <title>Dear Tim,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112476#M6047</link>
      <description>&lt;P&gt;Dear Tim,&lt;BR /&gt;
	Thanks a lot&amp;nbsp;for your comments! I have tried your suggestions:&lt;/P&gt;

&lt;P&gt;1) The reason for not setting OMP_NUM_THREADS is that I was assuming it would default to the available cores. This works for the 64-bit but not for the 32-bit. Settting the number to 40 (which is what it uses in case of 64-bit), does enforce the nr of cores to 40 also in 32 bit, but the CPU usage is still very small on 1 of the 2 nodes (hence not improving performance). Setting OMP_PLACES to "cores" did not seem to change much.&lt;BR /&gt;
	2) Setting OMP_STACKSIZE to 16M does not change anything unfortunately (I did verify the value of the environment variable using an echo command).&lt;BR /&gt;
	3) Activating the /Qopt-report:4 option during linking did not reveal any differences between 32 and 64 bit for the OpenMP part. For both compilations "DEFINED REGION WAS PARALLELIZED" is reported.&lt;BR /&gt;
	4) The reason for wanting to run 32-bit is that in future I will have to compile the program as a library (DLL) as it has to be coupled to an external program which is unfortunately only available in 32-bit..&lt;BR /&gt;
	5) Sorry, I meant MB. So if I look in the task manager, the memory usage is approx. 50MB.&lt;/P&gt;

&lt;P&gt;So I am still a puzzled, should I disabling hyper-threading?? I just noticed in the environment variables there also is the NUMBER_OF_PROCESSORS which is set to 20 only (I tried changing it to 40, but the system did not effectuate this change as verified by an echo statement).&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;PS I could not find a way to move my post to the Fortran forum.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2016 14:34:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112476#M6047</guid>
      <dc:creator>Windyman</dc:creator>
      <dc:date>2016-12-13T14:34:34Z</dc:date>
    </item>
    <item>
      <title>The "OMP_PLACES" environment</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112477#M6048</link>
      <description>&lt;P&gt;The "OMP_PLACES" environment variable is a relatively recent addition.&amp;nbsp; You should be able to control thread placement using the legacy "KMP_AFFINITY" environment variable.&amp;nbsp;&amp;nbsp; There are several different ways to use KMP_AFFINITY to get the threads distributed across the two sockets, but I recommend a simple start:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;KMP_AFFINITY=verbose,scatter&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;With the "verbose" option, when you run the job and it reaches its first parallel region, the OpenMP runtime will print out a full listing of where all of the logical processors are located (i.e., the socket, core, and thread context), followed by a full listing of the OpenMP threads and which logical processor(s) they are bound to.&lt;/P&gt;

&lt;P&gt;With the "scatter" option, the threads will be spread as far apart as possible -- alternating between sockets, then interleaving across cores, then repeating the pattern but using the second thread context on each core.&amp;nbsp; E.g., for OMP_NUM_THREADS=20, you should get one thread on each core in each socket.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The primary alternative to "scatter" is "compact", which reverses all three levels of the interleaving -- first alternate thread contexts, then cores, and finally sockets.&amp;nbsp; In this case for OMP_NUM_THREADS=20, you should get one thread on each logical processor in the first socket, and nothing in the second socket.&lt;/P&gt;

&lt;P&gt;Once you are satisfied with the behavior you can remove the "verbose" clause.&lt;/P&gt;

&lt;P&gt;There are lots of other approaches, but the "verbose" option to KMP_AFFINITY is the best way to be sure what the runtime library is doing.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Tue, 13 Dec 2016 20:41:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112477#M6048</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-13T20:41:08Z</dc:date>
    </item>
    <item>
      <title>Dear John,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112478#M6049</link>
      <description>&lt;P&gt;Dear John,&lt;BR /&gt;
	Thanks very much for your suggestion. I have entered the KMP_AFFINITY variable as suggested. Unfortunately the scatter option did not give the desired result. However I was able to diagnose using the verbose clause:&lt;/P&gt;

&lt;P&gt;For win32:&lt;BR /&gt;
	OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.&lt;BR /&gt;
	OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info&lt;BR /&gt;
	OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,&lt;BR /&gt;
	9,10,11,12,13,14,15,16,17,18,19}&lt;BR /&gt;
	OMP: Info #156: KMP_AFFINITY: 20 available OS procs&lt;BR /&gt;
	OMP: Info #157: KMP_AFFINITY: Uniform topology&lt;BR /&gt;
	OMP: Info #179: KMP_AFFINITY: 1 packages x 10 cores/pkg x 2 threads/core (10 tot&lt;/P&gt;

&lt;P&gt;For x64:&lt;BR /&gt;
	OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.&lt;BR /&gt;
	OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info&lt;BR /&gt;
	OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: {0,1,2,3,4,5,6,&lt;BR /&gt;
	7,8,9,10,11,12,13,14,15,16,17,18,19}&lt;BR /&gt;
	OMP: Info #156: KMP_AFFINITY: 40 available OS procs&lt;BR /&gt;
	OMP: Info #157: KMP_AFFINITY: Uniform topology&lt;BR /&gt;
	OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 2 threads/core (20 tot&lt;/P&gt;

&lt;P&gt;I am suspecting that for win32 the the 2nd package is simply not available. So my hypotheis is that whatever I do my program can simply not access it in 32-bit. But possibly I can change a software (windows?) setting elsewhere to get what I want (Respecting initial OS proc set yes or no??). Or is that a different forum again? Suggestions welcome&amp;nbsp; from you black-belters out there..&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2016 23:08:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112478#M6049</guid>
      <dc:creator>Windyman</dc:creator>
      <dc:date>2016-12-13T23:08:07Z</dc:date>
    </item>
    <item>
      <title>Interesting results....</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112479#M6050</link>
      <description>&lt;P&gt;Interesting results....&lt;/P&gt;

&lt;P&gt;Notice that in the win32 case the output of KMP_AFFINITY includes "Initial OS proc set respected", while for the 64-bit version the KMP_AFFINITY message is "Initial OS proc set not respected".&amp;nbsp;&amp;nbsp;&amp;nbsp; I don't know why there should be a difference, but you can override the behavior in the win32 case by adding the "norespect" option to KMP_AFFINITY:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;KMP_AFFINITY=verbose,scatter,norespect&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;This may not help (there may be some other reason that win32 only sees one socket), but it is a quick and easy test....&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2016 23:24:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112479#M6050</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-13T23:24:02Z</dc:date>
    </item>
    <item>
      <title>This is the result for</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112480#M6051</link>
      <description>&lt;P&gt;This is the result for setting norespect:&lt;/P&gt;

&lt;P&gt;OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.&lt;BR /&gt;
	OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info&lt;BR /&gt;
	OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: {0,1,2,3,4,5,6,&lt;BR /&gt;
	7,8,9,10,11,12,13,14,15,16,17,18,19}&lt;BR /&gt;
	OMP: Info #156: KMP_AFFINITY: 20 available OS procs&lt;BR /&gt;
	OMP: Info #157: KMP_AFFINITY: Uniform topology&lt;BR /&gt;
	OMP: Info #179: KMP_AFFINITY: 1 packages x 10 cores/pkg x 2 threads/core (10 tot&lt;BR /&gt;
	al cores)&lt;/P&gt;

&lt;P&gt;Indeed now the proc set is not respected but it does not have the desired effect..&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 09:01:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112480#M6051</guid>
      <dc:creator>Windyman</dc:creator>
      <dc:date>2016-12-14T09:01:42Z</dc:date>
    </item>
    <item>
      <title>Definitely looks like a</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112481#M6052</link>
      <description>&lt;P&gt;Definitely looks like a Windows problem.&amp;nbsp; Hard to tell if it is a bug or a feature, but one would hope that an OS as recent as Windows Server 2012 would know how to handle multi-socket systems....&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 16:39:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/openMP-difference-32-vs-64-bit/m-p/1112481#M6052</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-14T16:39:56Z</dc:date>
    </item>
  </channel>
</rss>

