SetThreadIdealProcessor() or SetThreadAffinityMask() or both?

jarnell · ‎09-23-2007

I am writing an application to run on a Quad Core Q6600 on Windows XP

I have four threads which I would like to force to run on one core each - should I use SetThreadIdealProcessor() or SetThreadAffinityMask() or both?

With SetThreadIdealProcessor( ) does the parameter dwIdealProcessor start at zero?

Also, on the Q6600, is it correct that cores 0 and 2 share the same L2 cache, and cores 1 and 3 share a separate cache?

Does this numbering correspond to the numbering used in the Windows XP functions?

Sorry for all these questions!

jimdempseyatthecove · ‎09-23-2007

Jarnell,

SetThreadIdealProcessor takes a zero based processor number

The intel site has a white paper on how to determine cache associations. It is not good practice to assume cache sharing is always 0/2, 1/3 or 0/1, 2/3 as processor designs do change. The better practice is to write or obtain code that makes this determination at program startup time.

See MSDN articles regarding GetLogicalProcessorInformation

From MSDN

GetLogicalProcessor can be used to get information about the relationship between logical processors in the system, including:

The logical processors that are part of a NUMA node.
The logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios.

Your application can use this information when affinitizing your threads and processes to take best advantage of the hardware properties of the platform, or to determine the number of logical and physical processors for licensing purposes.

Each of the SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures returned in the buffer contains the following:

A logical processor affinity mask, which indicates the logical processors that the information in the structure applies to.
A logical processor mask of type LOGICAL_PROCESSOR_RELATIONSHIP, which indicates the relationship between the logical processors in the mask. Applications calling this function must be prepared to handle additional indicator values in the future.

Note that the order in which the structures are returned in the buffer may change between calls to this function.

The size of the SYSTEM_LOGICAL_PROCESSOR_INFORMATION structure varies between processor architectures and versions of Windows. For this reason, applications should first call this function to obtain the required buffer size, then dynamically allocate memory for the buffer."

Jim Dempsey

jarnell · ‎09-23-2007

Thanks Jim, I shall have a look at this

I assume "hyperthreading" refers to the older P4 architecture, not the Q6600 Quad Core architecture.

Also I would assume that cache-sharing is of great significance for something like the Q6600, where there are two independent L2 caches - presumably I need to avoid having these fight over the same chunks of memory.

I also have a Core 2 Duo E6600 machine - at the moment I'm developing an application which I'm hoping to optimise for the Core 2 Duo and Quad processors

TimP · ‎09-23-2007

Yes, for the quad core, there could be an advantage in affinitizing threads which most frequently share groups of cache lines to the cores which share L2 cache. The normal BIOS numbering is likely to alternate, as you mentioned, to help spread work across caches when not all cores are in use.
If your application activates strided hardware prefetch, optimized placement of threads which use neighboring cache lines should help avoid extra memory traffic due to prefetch. Those cache lines which are prefetched but not used by one thread shouldn't increase buss congestion if they are used by the other thread.
Likewise, if your threads access memory in too scattered a fashion, adjacent sector prefetch may prove disadvantageous.

jarnell · ‎09-24-2007

GetLogicalProcessorInformation - this doesn't seem to be available in Windows XP (32 bit)

From msdn @ microsoft:

Requirements

Client	Requires WindowsVista or WindowsXP Professional x64 Edition.
Server	Requires Windows Server2008 or Windows Server2003.
Header	Declared in Winbase.h; include Windows.h.
Library	Use Kernel32.lib.
DLL	Requires Kernel32.dll.

.. bit of a nuisance

GetNumaProcessorNode( )

...seems to work though

jimdempseyatthecove · ‎09-24-2007

Jarnell,

Just curious here. On your Q6600, on WinXP 32-bit, when you run a loop calling GetNumaProcessorNode( ) using all 4 processor numbers (0:3) what are the reported NodalNumbers? Of interest to me is do you see 1 node, 2, nodes or 3 nodes?

If the underlaying NUMA support is not present on WinXP 32-bit (as implied by the documentation) then you would expect all processors to be reported as being on the same node. Not having that configuration here I can only speculate.

You should note that the physical processor core bit position in the affinity mask is not guaranteed to be in the same position across all operating systems or revisions there of. So use of system function calls is recommended.

However, if the system function calls do not offer this information (e.g. WinXP Home) then you could write a small startup function that probes memory andmakes the associations.

Jim Dempsey

jarnell · ‎09-25-2007

for

(int n=0;n<4;n++){
UCHAR NodeNum;
ULONGLONG Mask;
if (GetNumaProcessorNode( n, &NodeNum )){
GetNumaNodeProcessorMask( n, &Mask );
printf(" Proc %i NodeNum %i Mask %x ",
n, (int) NodeNum, Mask );
}
}

On XP 32: NodeNum is set to zero --- Mask is set to 0x0F ( so I do have 4 processors!)

I suspect this is not exactly correct.

Incidentally Ihad some code developed for my 2 Core E6600 which does a lot of audio processing in buffers using two threads.

For a test I just pushed up the number of threads to 4 and ran it on the Q6600

The E6600 took about 7.5 seconds, the Q6600 about 5 seconds.

Not a bad result without any optimisation at all.

jimdempseyatthecove · ‎09-25-2007

What Mask=0x0F tells you is all 4 processors reside on NUMA node 0. And that the line of demarcation for NUMA is at the memory bus interface for the Q6600 (this package only has one memory bus interface). Therefore for GetNumaProcessorNode is insufficient for your purpose of determining cache groupings for processors.

You might want to consider adding the startup test as I suggested earlier. Or lacking that use an environment variable and do a lookup to get its value (lacking specified value then use a hard wired default).

There was supposed to be something in the OpenMP spec regarding a generic way to make this determination.

Jim Dempsey

ravenous_wolves · ‎04-04-2008

JimDempseyAtTheCove:

The intel site has a white paper on how to determine cache associations. It is not good practice to assume cache sharing is always 0/2, 1/3 or 0/1, 2/3 as processor designs do change. The better practice is to write or obtain code that makes this determination at program startup time.

What is the link to the white paper on determining cache associations? The closest I've come is:

http://software.intel.com/en-us/articles/detecting-multi-core-processor-topology-in-an-ia-32-platform

which describes how to detect processor topology, and gives some info on how cores share any given cache, and thus how many caches are in a given package, but I cannot find any CPUID information on which cores in a package share a cache.

On the E5440 it seems to be 0/1 and 2/3.

A previous poster told me a new white paper on cache topology would be coming out in June....