<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907772#M4502</link>
    <description>&lt;P&gt;Jarnell,&lt;/P&gt;
&lt;P&gt;Just curious here. On your Q6600, on WinXP 32-bit, when you run a loop calling &lt;FONT size="2"&gt;&lt;FONT size="3"&gt;GetNumaProcessorNode( )&lt;/FONT&gt; &lt;/FONT&gt;using all 4 processor numbers (0:3) what are the reported NodalNumbers? Of interest to me is do you see 1 node, 2, nodes or 3 nodes?&lt;/P&gt;
&lt;P&gt;If the underlaying NUMA support is not present on WinXP 32-bit (as implied by the documentation) then you would expect all processors to be reported as being on the same node. Not having that configuration here I can only speculate.&lt;/P&gt;
&lt;P&gt;You should note that the physical processor core bit position in the affinity mask is not guaranteed to be in the same position across all operating systems or revisions there of. So use of system function calls is recommended.&lt;/P&gt;
&lt;P&gt;However, if the system function calls do not offer this information (e.g. WinXP Home) then you could write a small startup function that probes memory andmakes the associations.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 24 Sep 2007 12:57:31 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2007-09-24T12:57:31Z</dc:date>
    <item>
      <title>SetThreadIdealProcessor() or SetThreadAffinityMask() or both?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907767#M4497</link>
      <description>&lt;P&gt;I am writing an application to run on a Quad Core Q6600 on Windows XP&lt;/P&gt;
&lt;P&gt;I have four threads which I would like to force to run on one core each - should I use SetThreadIdealProcessor() or SetThreadAffinityMask() or both?&lt;/P&gt;
&lt;P&gt;With SetThreadIdealProcessor( ) does the parameter &lt;EM&gt;dwIdealProcessor&lt;/EM&gt; start at zero?&lt;/P&gt;
&lt;P&gt;Also, on the Q6600, is it correct that cores 0 and 2 share the same L2 cache, and cores 1 and 3 share a separate cache?&lt;/P&gt;
&lt;P&gt;Does this numbering correspond to the numbering used in the Windows XP functions?&lt;/P&gt;
&lt;P&gt;Sorry for all these questions!&lt;/P&gt;</description>
      <pubDate>Sun, 23 Sep 2007 15:08:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907767#M4497</guid>
      <dc:creator>jarnell</dc:creator>
      <dc:date>2007-09-23T15:08:16Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907768#M4498</link>
      <description>&lt;P&gt;Jarnell,&lt;/P&gt;
&lt;P&gt;SetThreadIdealProcessor takes a zero based processor number&lt;/P&gt;
&lt;P&gt;The intel site has a white paper on how to determine cache associations. It is not good practice to assume cache sharing is always 0/2, 1/3 or 0/1, 2/3 as processor designs do change. The better practice is to write or obtain code that makes this determination at program startup time.&lt;/P&gt;
&lt;P&gt;See MSDN articles regarding &lt;STRONG&gt;GetLogicalProcessorInformation&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;From MSDN&lt;/P&gt;
&lt;P&gt;GetLogicalProcessor can be used to get information about the relationship between logical processors in the system, including: &lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The logical processors that are part of a &lt;A&gt;NUMA&lt;/A&gt; node. 
&lt;/LI&gt;&lt;LI&gt;The logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios. &lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Your application can use this information when affinitizing your threads and processes to take best advantage of the hardware properties of the platform, or to determine the number of logical and physical processors for licensing purposes.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Each of the SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures returned in the buffer contains the following: 
&lt;/P&gt;&lt;P&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;A logical processor affinity mask, which indicates the logical processors that the information in the structure applies to. 
&lt;/LI&gt;&lt;LI&gt;A logical processor mask of type &lt;A&gt;LOGICAL_PROCESSOR_RELATIONSHIP&lt;/A&gt;, which indicates the relationship between the logical processors in the mask. Applications calling this function must be prepared to handle additional indicator values in the future. &lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Note that the order in which the structures are returned in the buffer may change between calls to this function.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The size of the SYSTEM_LOGICAL_PROCESSOR_INFORMATION structure varies between processor architectures and versions of Windows. For this reason, applications should first call this function to obtain the required buffer size, then dynamically allocate memory for the buffer."&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sun, 23 Sep 2007 16:04:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907768#M4498</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-09-23T16:04:42Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907769#M4499</link>
      <description>&lt;P&gt;Thanks Jim, I shall have a look at this&lt;/P&gt;
&lt;P&gt;I assume "hyperthreading" refers to the older P4 architecture, not the Q6600 Quad Core architecture.&lt;/P&gt;
&lt;P&gt;Also I would assume that cache-sharing is of great significance for something like the Q6600, where there are two independent L2 caches - presumably I need to avoid having these fight over the same chunks of memory.&lt;/P&gt;
&lt;P&gt;I also have a Core 2 Duo E6600 machine - at the moment I'm developing an application which I'm hoping to optimise for the Core 2 Duo and Quad processors&lt;/P&gt;</description>
      <pubDate>Sun, 23 Sep 2007 16:36:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907769#M4499</guid>
      <dc:creator>jarnell</dc:creator>
      <dc:date>2007-09-23T16:36:38Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907770#M4500</link>
      <description>Yes, for the quad core, there could be an advantage in affinitizing threads which most frequently share groups of cache lines to the cores which share L2 cache. The normal BIOS numbering is likely to alternate, as you mentioned, to help spread work across caches when not all cores are in use.&lt;BR /&gt;If your application activates strided hardware prefetch, optimized placement of threads which use neighboring cache lines should help avoid extra memory traffic due to prefetch. Those cache lines which are prefetched but not used by one thread shouldn't increase buss congestion if they are used by the other thread. &lt;BR /&gt;Likewise, if your threads access memory in too scattered a fashion, adjacent sector prefetch may prove disadvantageous.&lt;BR /&gt;</description>
      <pubDate>Sun, 23 Sep 2007 19:11:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907770#M4500</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-09-23T19:11:25Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907771#M4501</link>
      <description>&lt;P&gt;&lt;STRONG&gt;GetLogicalProcessorInformation&lt;/STRONG&gt; - this doesn't seem to be available in Windows XP (32 bit)&lt;/P&gt;
&lt;P&gt;From msdn @ microsoft:&lt;/P&gt;
&lt;H4&gt;Requirements&lt;/H4&gt;
&lt;P&gt;
&lt;TABLE class="psdkRequirements"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TH&gt;
&lt;P&gt;Client&lt;/P&gt;&lt;/TH&gt;
&lt;TD&gt;
&lt;P&gt;Requires WindowsVista or WindowsXP Professional x64 Edition.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TH&gt;
&lt;P&gt;Server&lt;/P&gt;&lt;/TH&gt;
&lt;TD&gt;
&lt;P&gt;Requires Windows Server2008 or Windows Server2003.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TH&gt;
&lt;P&gt;Header&lt;/P&gt;&lt;/TH&gt;
&lt;TD&gt;
&lt;P&gt;Declared in Winbase.h; include Windows.h.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TH&gt;
&lt;P&gt;Library&lt;/P&gt;&lt;/TH&gt;
&lt;TD&gt;
&lt;P&gt;Use Kernel32.lib.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TH&gt;
&lt;P&gt;DLL&lt;/P&gt;&lt;/TH&gt;
&lt;TD&gt;
&lt;P&gt;Requires Kernel32.dll.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;.. bit of a nuisance&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;GetNumaProcessorNode( )&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;...seems to work though&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Sep 2007 09:34:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907771#M4501</guid>
      <dc:creator>jarnell</dc:creator>
      <dc:date>2007-09-24T09:34:42Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907772#M4502</link>
      <description>&lt;P&gt;Jarnell,&lt;/P&gt;
&lt;P&gt;Just curious here. On your Q6600, on WinXP 32-bit, when you run a loop calling &lt;FONT size="2"&gt;&lt;FONT size="3"&gt;GetNumaProcessorNode( )&lt;/FONT&gt; &lt;/FONT&gt;using all 4 processor numbers (0:3) what are the reported NodalNumbers? Of interest to me is do you see 1 node, 2, nodes or 3 nodes?&lt;/P&gt;
&lt;P&gt;If the underlaying NUMA support is not present on WinXP 32-bit (as implied by the documentation) then you would expect all processors to be reported as being on the same node. Not having that configuration here I can only speculate.&lt;/P&gt;
&lt;P&gt;You should note that the physical processor core bit position in the affinity mask is not guaranteed to be in the same position across all operating systems or revisions there of. So use of system function calls is recommended.&lt;/P&gt;
&lt;P&gt;However, if the system function calls do not offer this information (e.g. WinXP Home) then you could write a small startup function that probes memory andmakes the associations.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Sep 2007 12:57:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907772#M4502</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-09-24T12:57:31Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907773#M4503</link>
      <description>&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;&lt;FONT face="Courier New"&gt;for&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;FONT size="2"&gt;(&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;int&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;FONT face="Courier New"&gt; n=0;n&amp;lt;4;n++){&lt;BR /&gt; &lt;/FONT&gt;&lt;FONT face="Courier New"&gt;UCHAR NodeNum;&lt;BR /&gt; &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;FONT face="Courier New"&gt;ULONGLONG Mask;&lt;BR /&gt; &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Courier New" color="#0000ff" size="2"&gt;if&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;FONT face="Courier New"&gt; (GetNumaProcessorNode( n, &amp;amp;NodeNum )){&lt;BR /&gt; &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;FONT face="Courier New"&gt;GetNumaNodeProcessorMask( n, &amp;amp;Mask );&lt;BR /&gt; &lt;/FONT&gt;&lt;FONT face="Courier New"&gt;printf(&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;FONT color="#800080" size="2"&gt;" Proc %i NodeNum %i Mask %x 
"&lt;/FONT&gt;&lt;FONT size="2"&gt;,&lt;BR /&gt; n, (&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;int&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;FONT face="Courier New"&gt;) NodeNum, Mask );&lt;BR /&gt; &lt;/FONT&gt;&lt;FONT face="Courier New"&gt;}&lt;BR /&gt; &lt;/FONT&gt;&lt;FONT face="Courier New"&gt;}&lt;/FONT&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;
&lt;P&gt;On XP 32: NodeNum is set to zero --- Mask is set to 0x0F ( so I do have 4 processors!)&lt;/P&gt;
&lt;P&gt;I suspect this is not exactly correct.&lt;/P&gt;
&lt;P&gt;Incidentally Ihad some code developed for my 2 Core E6600 which does a lot of audio processing in buffers using two threads.&lt;/P&gt;
&lt;P&gt;For a test I just pushed up the number of threads to 4 and ran it on the Q6600&lt;/P&gt;
&lt;P&gt;The E6600 took about 7.5 seconds, the Q6600 about 5 seconds.&lt;/P&gt;
&lt;P&gt;Not a bad result without any optimisation at all.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Sep 2007 09:03:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907773#M4503</guid>
      <dc:creator>jarnell</dc:creator>
      <dc:date>2007-09-25T09:03:50Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907774#M4504</link>
      <description>&lt;P&gt;What Mask=0x0F tells you is all 4 processors reside on NUMA node 0. And that the line of demarcation for NUMA is at the memory bus interface for the Q6600 (this package only has one memory bus interface). Therefore for GetNumaProcessorNode is insufficient for your purpose of determining cache groupings for processors.&lt;/P&gt;
&lt;P&gt;You might want to consider adding the startup test as I suggested earlier. Or lacking that use an environment variable and do a lookup to get its value (lacking specified value then use a hard wired default).&lt;/P&gt;
&lt;P&gt;There was supposed to be something in the OpenMP spec regarding a generic way to make this determination.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Sep 2007 13:03:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907774#M4504</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-09-25T13:03:39Z</dc:date>
    </item>
    <item>
      <title>Re: SetThreadIdealProcessor() or SetThreadAffinityMask() or bot</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907775#M4505</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;DIV&gt;&lt;IMG src="https://community.intel.com/file/6745" /&gt; &lt;STRONG&gt;JimDempseyAtTheCove:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;
&lt;P&gt;The intel site has a white paper on how to determine cache associations. It is not good practice to assume cache sharing is always 0/2, 1/3 or 0/1, 2/3 as processor designs do change. The better practice is to write or obtain code that makes this determination at program startup time.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;What is the link to the white paper on determining cache associations? The closest I've come is:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://community.intel.com/en-us/articles/article_nice_name"&gt;http://software.intel.com/en-us/articles/detecting-multi-core-processor-topology-in-an-ia-32-platform&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;which describes how to detect processor topology, and gives some info on how cores share any given cache, and thus how many caches are in a given package, but I cannot find any CPUID information on which cores in a package share a cache. &lt;/P&gt;
&lt;P&gt;On the E5440 it seems to be 0/1 and 2/3.&lt;/P&gt;
&lt;P&gt;A previous poster told me a new white paper on cache topology would be coming out in June....&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2008 15:23:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/SetThreadIdealProcessor-or-SetThreadAffinityMask-or-both/m-p/907775#M4505</guid>
      <dc:creator>ravenous_wolves</dc:creator>
      <dc:date>2008-04-04T15:23:48Z</dc:date>
    </item>
  </channel>
</rss>

