Detecting which cores share cache on Quad-core E5440

ravenous_wolves · ‎04-01-2008

I'm working with a dual-processor system where each processor is (currently) an E5440 quad-core processor. Each processor has 2x6 = 12MB cache. I'm desiring to optimally assign some number of threads across the total 8 cores to minimize cache contention.

I've been referencing the following articles,
http://software.intel.com/en-us/articles/detecting-multi-core-processor-topology-in-an-ia-32-platform
http://www.devx.com/go-parallel/Article/27398

The first article appears to say the same technique can be used to determine the # of cores per cache. The second article attempts to do that extension, but the code doesn't compile under Visual Studio 2005.

However, even once I know the # of cores per cache I still don't know that I can assume which cores within each processor reference each cache. I was very surprised to see that the windows logical processors do not map onto physical processors in the way I had expected (0-3 = processor 1, 4-7 each processor 2). On my system, I receive the following output:

Relationships between OS affinity mask, Initial APIC ID, and 3-level sub-IDs:

AffinityMask = 1; Initial APIC = 0; Physical ID = 0, Core ID = 0, SMT ID = 0
AffinityMask = 2; Initial APIC = 4; Physical ID = 4, Core ID = 0, SMT ID = 0
AffinityMask = 4; Initial APIC = 1; Physical ID = 0, Core ID = 1, SMT ID = 0
AffinityMask = 8; Initial APIC = 2; Physical ID = 0, Core ID = 2, SMT ID = 0
AffinityMask = 16; Initial APIC = 3; Physical ID = 0, Core ID = 3, SMT ID = 0
AffinityMask = 32; Initial APIC = 5; Physical ID = 4, Core ID = 1, SMT ID = 0
AffinityMask = 64; Initial APIC = 6; Physical ID = 4, Core ID = 2, SMT ID = 0
AffinityMask = 128; Initial APIC = 7; Physical ID = 4, Core ID = 3, SMT ID =0

If I was running 4 threads, my previous algorithm would have assigned these to logical indices: 0, 2, 4, 6. Three of these would be running on physical processor #1, whereas I had expected to only have 2 running on each physical processor.

So, what assumptions can I make about caches? If I'm able to either modify the Intel code, or get the DevX code to work, and I know the # of cores per L2 cache, can I assume one cache module works with cores 0,1 and the other 2,3; or, does one module work with cores 0,2 and the other 1,3? Or, is it possible that one module works with 0,3 and the other 1,2?

I don't have access to the GetLogicalProcessorInformation function right now, so I'm doing all this through CPUID inline assembler.

Thanks,

levicki · ‎04-01-2008

The main problem with this is that you cannot expect that the relationship you are seeing stay the same on another computer because if I remember correctly OS assignment varies depending on the BIOS APIC table and OS scheduler logic.

Perhaps the best way would be to use Initial APIC ID because successive numbers seem to be representing adjacent cores for each physical package. If I understand your numbers correctly you would want to assign threads to the cores with APIC ID 0, 2, 4 and 6.

Number of cores sharing each cache can be found out by enumerating deterministic cache parameter leaf (CPUID instruction with EAX=4). More information about cache sharing among cores and thread you can find on page 33 section 3.1.5.1 of AP-485 Intel Processor Identification and the CPUID Instruction document order #241618.

ravenous_wolves · ‎04-02-2008

I understand that I can't rely on the mapping I quoted above, I'm looking at writing logic to query the processor(s) to obtain this information and then perform thread assignment based on the results.

Given two physical packages each of which with four cores, and two caches, my goal is to assign four threads such that each of them resides on their own cache.

CPUID.4 tells me that there are 2 cores per cache, but I have not found any documentation which says definitively that the first two cores in a physical package use the first cache in that physical package.

From the data above, which cores use the first cache in the first physical package (ID = 0)?
- APICs 0,1? In my example data above, this would be LP 0,2.
- APICs 0,2? In my example data above, this would be LP 0,3
- APICs 0,3? In my example data above, this would be LP 0,4
- Something else?

Is this known a priori, or is there some method of dynamically determining it?

Thanks,

ravenous_wolves · ‎04-02-2008

I did some measurements on this based on the following assumption:

[ (0, 2), (3, 4) ][ (1, 5), (6, 7) ]

Where the brackets [] represent physical processors, the parentheses represent the two cache elements within each physical processor, and the numbers represent the logical processors the OS understands, such that 1 << N is the processor affinity mask for a given logical processor index N.

For a highly cache-dependent operation, performed a few thousand times, concurrently on all possiblepairsof logical processors, ie. (0, 2), (0,3), (0,4), ..., (1, 2), (1, 3), (1,4), ..., (2, 3), (2,4), ...., (6,7).

I get a a performance measurement of ~2 ms per for all combinations residing on separate caches. Being on a different physical processor does not provide any additional benefit. I get a measurement of ~4.5ms for each of four pairs sharing a cache element.

Mysteriously (?), I get ~3.5ms for each pair which includes LP 0. The pair which shares cache with core 0 actually returns an elevated # of 4.8. I presume this is because some portion of the OS resides permanently on the first core.

So, on my current processors (E5440) the first two APICs within a physical package use the first cache element in that physical processor. The question, however, still stands as to whether this is behavior I can depend on or not.

Cheers,

SHIH_K_Intel · ‎04-02-2008

For your information, we're working on an update of the white paper on processor topology enumeration and the associated reference code. I expect them to be ready in the June time frame.

The update is expected to include enhancement in several areas:

1. System topology enumeration using x2APIC ID where available. Enumeration using initial APIC ID will also be supported when x2APIC ID is not available.

2. Reference code for cache topology enumeration will also be included along with CPU topology.

The cache topology enumeration is based on those published in the Intel 64 Architecture Software Optimization Manual.

ravenous_wolves · ‎04-07-2008

Apparently there is a CacheIndex encoded into the APIC_ID. There is psuedo-code showing how to extract this in:

Refer to section 7.10.3 of the

Intel@ 64 and IA-32 Software Developers Manual,

Volume 3A: System Programming Guide
.

The logic I ended up using looks like this:

int nL2CacheIDMaskWidth = find_maskwidth(nLogicalProcessorsPerL2Cache_supported);

char nL2CacheIDMask = (char) (0xFF << nL2CacheIDMaskWidth);

int nL2CacheIndex = ((nAPIC_ID & nL2CacheIDMask) >> nL2CacheIDMaskWidth);