- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am writing an application to run on a Quad Core Q6600 on Windows XP
I have four threads which I would like to force to run on one core each - should I use SetThreadIdealProcessor() or SetThreadAffinityMask() or both?
With SetThreadIdealProcessor( ) does the parameter dwIdealProcessor start at zero?
Also, on the Q6600, is it correct that cores 0 and 2 share the same L2 cache, and cores 1 and 3 share a separate cache?
Does this numbering correspond to the numbering used in the Windows XP functions?
Sorry for all these questions!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jarnell,
SetThreadIdealProcessor takes a zero based processor number
The intel site has a white paper on how to determine cache associations. It is not good practice to assume cache sharing is always 0/2, 1/3 or 0/1, 2/3 as processor designs do change. The better practice is to write or obtain code that makes this determination at program startup time.
See MSDN articles regarding GetLogicalProcessorInformation
From MSDN
GetLogicalProcessor can be used to get information about the relationship between logical processors in the system, including:
- The logical processors that are part of a NUMA node.
- The logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios.
Your application can use this information when affinitizing your threads and processes to take best advantage of the hardware properties of the platform, or to determine the number of logical and physical processors for licensing purposes.
Each of the SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures returned in the buffer contains the following:
- A logical processor affinity mask, which indicates the logical processors that the information in the structure applies to.
- A logical processor mask of type LOGICAL_PROCESSOR_RELATIONSHIP, which indicates the relationship between the logical processors in the mask. Applications calling this function must be prepared to handle additional indicator values in the future.
Note that the order in which the structures are returned in the buffer may change between calls to this function.
The size of the SYSTEM_LOGICAL_PROCESSOR_INFORMATION structure varies between processor architectures and versions of Windows. For this reason, applications should first call this function to obtain the required buffer size, then dynamically allocate memory for the buffer."
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim, I shall have a look at this
I assume "hyperthreading" refers to the older P4 architecture, not the Q6600 Quad Core architecture.
Also I would assume that cache-sharing is of great significance for something like the Q6600, where there are two independent L2 caches - presumably I need to avoid having these fight over the same chunks of memory.
I also have a Core 2 Duo E6600 machine - at the moment I'm developing an application which I'm hoping to optimise for the Core 2 Duo and Quad processors
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your application activates strided hardware prefetch, optimized placement of threads which use neighboring cache lines should help avoid extra memory traffic due to prefetch. Those cache lines which are prefetched but not used by one thread shouldn't increase buss congestion if they are used by the other thread.
Likewise, if your threads access memory in too scattered a fashion, adjacent sector prefetch may prove disadvantageous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GetLogicalProcessorInformation - this doesn't seem to be available in Windows XP (32 bit)
From msdn @ microsoft:
Requirements
Client |
Requires WindowsVista or WindowsXP Professional x64 Edition. |
---|---|
Server |
Requires Windows Server2008 or Windows Server2003. |
Header |
Declared in Winbase.h; include Windows.h. |
Library |
Use Kernel32.lib. |
DLL |
Requires Kernel32.dll. |
.. bit of a nuisance
GetNumaProcessorNode( )
...seems to work though
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jarnell,
Just curious here. On your Q6600, on WinXP 32-bit, when you run a loop calling GetNumaProcessorNode( ) using all 4 processor numbers (0:3) what are the reported NodalNumbers? Of interest to me is do you see 1 node, 2, nodes or 3 nodes?
If the underlaying NUMA support is not present on WinXP 32-bit (as implied by the documentation) then you would expect all processors to be reported as being on the same node. Not having that configuration here I can only speculate.
You should note that the physical processor core bit position in the affinity mask is not guaranteed to be in the same position across all operating systems or revisions there of. So use of system function calls is recommended.
However, if the system function calls do not offer this information (e.g. WinXP Home) then you could write a small startup function that probes memory andmakes the associations.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for
(int n=0;n<4;n++){UCHAR NodeNum;
ULONGLONG Mask;
if (GetNumaProcessorNode( n, &NodeNum )){
GetNumaNodeProcessorMask( n, &Mask );
printf(" Proc %i NodeNum %i Mask %x ",
n, (int) NodeNum, Mask );
}
}
On XP 32: NodeNum is set to zero --- Mask is set to 0x0F ( so I do have 4 processors!)
I suspect this is not exactly correct.
Incidentally Ihad some code developed for my 2 Core E6600 which does a lot of audio processing in buffers using two threads.
For a test I just pushed up the number of threads to 4 and ran it on the Q6600
The E6600 took about 7.5 seconds, the Q6600 about 5 seconds.
Not a bad result without any optimisation at all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What Mask=0x0F tells you is all 4 processors reside on NUMA node 0. And that the line of demarcation for NUMA is at the memory bus interface for the Q6600 (this package only has one memory bus interface). Therefore for GetNumaProcessorNode is insufficient for your purpose of determining cache groupings for processors.
You might want to consider adding the startup test as I suggested earlier. Or lacking that use an environment variable and do a lookup to get its value (lacking specified value then use a hard wired default).
There was supposed to be something in the OpenMP spec regarding a generic way to make this determination.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
JimDempseyAtTheCove:The intel site has a white paper on how to determine cache associations. It is not good practice to assume cache sharing is always 0/2, 1/3 or 0/1, 2/3 as processor designs do change. The better practice is to write or obtain code that makes this determination at program startup time.
What is the link to the white paper on determining cache associations? The closest I've come is:
which describes how to detect processor topology, and gives some info on how cores share any given cache, and thus how many caches are in a given package, but I cannot find any CPUID information on which cores in a package share a cache.
On the E5440 it seems to be 0/1 and 2/3.
A previous poster told me a new white paper on cache topology would be coming out in June....
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page