- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a HP Z800 with a dual Xeon X5660. 2 x 6 cores x Hyperthreading = 24 logical processors.
In Windows 7 x64, calling GetSystemInfo returns 24 logical processors. Calling GetLogicalProcessorInformation also returns 24. (2 processor packages, 24 logical processors, 12 processor cores).
But calling __cpuid returns 32. I also get 32 even if I disabled hyperthreading in the BIOS.
CPU-Z reports the right information. In the task manager, I see 24 cpus.
I am trying to determine the affinity of each core so that I only run 1 thread per core (no collision between 2 logical processors running on the same core).
I have another machine with a dual Xeon E5520 (2 x 4 cores x Hyperthreading = 16 logical processors).
On the Xeon E5520, setting the affinity to 1, 2, 4, 8 will make the thread run on the first 4 cores while setting the affinity 16, 32, 64 and 128 will make the thread run on the next 4 cores.
On the Xeon X5660, setting the affinity from 1 to 32 will make the thread run on the 6 cores while setting the affinity to 64 to 2048 will make the thread run on different logical processors but on the same cores. Performance are much lower in that case.
I compiled a small program that I found on microsoft web site that shows an example of __cpuid and __cpuidex.
I made sure that I have the latest version of the BIOS and that all the chipset drivers are up to date according to the Intel Update Tool.
Any suggestions are welcomed and appreciated.
Here's the output that I got from the tool that I compiled:
For InfoType 0
CPUInfo[0] = 0xb
CPUInfo[1] = 0x756e6547
CPUInfo[2] = 0x6c65746e
CPUInfo[3] = 0x49656e69
For InfoType 1
CPUInfo[0] = 0x206c1
CPUInfo[1] = 0x200800
CPUInfo[2] = 0x29ee3ff
CPUInfo[3] = 0xbfebfbff
For InfoType 2
CPUInfo[0] = 0x55035a01
CPUInfo[1] = 0xf0b2ff
CPUInfo[2] = 0x0
CPUInfo[3] = 0xca0000
For InfoType 3
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
For InfoType 4
CPUInfo[0] = 0x3c004121
CPUInfo[1] = 0x1c0003f
CPUInfo[2] = 0x3f
CPUInfo[3] = 0x0
For InfoType 5
CPUInfo[0] = 0x40
CPUInfo[1] = 0x40
CPUInfo[2] = 0x3
CPUInfo[3] = 0x1120
For InfoType 6
CPUInfo[0] = 0x7
CPUInfo[1] = 0x2
CPUInfo[2] = 0x9
CPUInfo[3] = 0x0
For InfoType 7
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
For InfoType 8
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
For InfoType 9
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
For InfoType 10
CPUInfo[0] = 0x7300403
CPUInfo[1] = 0x4
CPUInfo[2] = 0x0
CPUInfo[3] = 0x603
For InfoType 11
CPUInfo[0] = 0x1
CPUInfo[1] = 0x2
CPUInfo[2] = 0x100
CPUInfo[3] = 0x0
For InfoType 80000000
CPUInfo[0] = 0x80000008
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
For InfoType 80000001
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x1
CPUInfo[3] = 0x2c100000
For InfoType 80000002
CPUInfo[0] = 0x65746e49
CPUInfo[1] = 0x2952286c
CPUInfo[2] = 0x6f655820
CPUInfo[3] = 0x2952286e
For InfoType 80000003
CPUInfo[0] = 0x55504320
CPUInfo[1] = 0x20202020
CPUInfo[2] = 0x20202020
CPUInfo[3] = 0x58202020
For InfoType 80000004
CPUInfo[0] = 0x30363635
CPUInfo[1] = 0x20402020
CPUInfo[2] = 0x30382e32
CPUInfo[3] = 0x7a4847
For InfoType 80000005
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
For InfoType 80000006
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x1006040
CPUInfo[3] = 0x0
For InfoType 80000007
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x100
For InfoType 80000008
CPUInfo[0] = 0x3028
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0
CPU String: GenuineIntel
Stepping ID = 1
Model = 12
Family = 6
Extended model = 2
CLFLUSH cache line size = 64
Logical Processor Count = 32
The following features are supported:
SSE3
MONITOR/MWAIT
CPL Qualified Debug Store
Virtual Machine Extensions
Enhanced Intel SpeedStep Technology
Thermal Monitor 2
Supplemental Streaming SIMD Extensions 3
L1 Context ID
CMPXCHG16B Instruction
xTPR Update Control
Perf\\Debug Capability MSR
SSE4.1 Extensions
SSE4.2 Extensions
PPOPCNT Instruction
x87 FPU On Chip
Virtual-8086 Mode Enhancement
Debugging Extensions
Page Size Extensions
Time Stamp Counter
RDMSR and WRMSR Support
Physical Address Extensions
Machine Check Exception
CMPXCHG8B Instruction
APIC On Chip
SYSENTER and SYSEXIT
Memory Type Range Registers
PTE Global Bit
Machine Check Architecture
Conditional Move/Compare Instruction
Page Attribute Table
36-bit Page Size Extension
CFLUSH Extension
Debug Store
Thermal Monitor and Clock Ctrl
MMX Technology
FXSAVE/FXRSTOR
SSE Extensions
SSE2 Extensions
Self Snoop
Multithreading Technology
Thermal Monitor
Pending Break Enable
LAHF/SAHF in 64-bit mode
RDTSCP instruction
64 bit Technology
CPU Brand String: Intel Xeon CPU X5660 @ 2.80GHz
Cache Line Size = 64
L2 Associativity = 6
Cache Size = 256K
Number of Cores = 16
ECX Index 0
Type: Data Cache
Level = 2
Self Initializing
Is Not Fully Associatve
Max Threads = 2
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 8
Number of Sets = 64
ECX Index 1
Type: Instruction Cache
Level = 2
Self Initializing
Is Not Fully Associatve
Max Threads = 2
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 4
Number of Sets = 128
ECX Index 2
Type: Unified Cache
Level = 3
Self Initializing
Is Not Fully Associatve
Max Threads = 2
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 8
Number of Sets = 512
ECX Index 3
Type: Unified Cache
Level = 4
Self Initializing
Is Not Fully Associatve
Max Threads = 32
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 16
Number of Sets = 12288
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
If there are 4 paths to L3for 6 cores then are there 4 L2 caches or 6?
The various docs and charts on Intel.com aren't quite clear on this.
In some places it is stated a total of 1MB of L2 cache in others it indicates 256KB L2 per core.
This would indicate two L2's are shared with 2 cores (similar to some of your other processors without L3).
If there ar 6 L2's then I would imagine some bits in a cpuid/cpuidex would indicate sharing some sort of MUX between L2 and L3.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On an 8 core processor then would this mean there is a 2-way followed by a 4-way?
IOW four pairs of cores, each pair sharing one of four paths into L3. It would seem to make sense.
This might seem to indicate that on a 6-core system, 2 of the cores experience less latency to the L3 cache.
Back in the 1980's I help (very little help) design a shared memory system for a cluster whereby the selector used a rotating priority. The switch would be free running until a core (processor in this case) indicates it wants access. The first (next)core encountered in the current rotating sequence with an access request would gain control of the switch. Subsequent accesses to the shared memory would remain locked on the core (processor) until the core microcode released the switch. With this scheme a core (processor) could perform multiple accesses through the switch incurring the latency overhead only once. You could alternately set the switch to a fixed priority (e.g. like SCSI bus).
The reason I bring this up is multiple-tiering could experience a similar benefit by having the switching logic be somewhat sticky.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe you can help me a bit more.
After setting the thread affinity mask, I get the APIC ID in order to retrieve the logical and physical ID of the logical procesor.
On the Intel Xeon E5520 @ 2.227 GHz (as reported by CPU-Z)
Family: 6
Model: A
Stepping: 5
Ext. Family: 6
Ext. Family: 1A
Revision: D0
I get pairs like this:
Affinity mask |
Logical ID |
Physical ID |
2^0 |
0 |
0 |
0 |
||
2^7 |
7 |
0 |
2^8 |
0 |
1 |
1 |
||
2^15 |
7 |
1 |
From the Xeon X5660 Westmere-EP.
Family: 6
Model: C
Stepping: 1
Ext. Family: 6
Ext. Family: 2C
Revision: B0
Affinity mask |
Logical ID |
Physical ID |
2^0 |
0 |
0 |
0 |
||
2^5 |
5 |
0 |
2^6 |
0 |
2 |
2 |
||
2^11 |
5 |
2 |
0 |
4 |
|
4 |
||
5 |
4 |
|
0 |
6 |
|
6 |
||
2^23 |
5 |
6 |
From the table, it is as if there are 4 different Physical IDs.
But, even for the E5520, the values that I get are not what I would have expected.
For the E5520, I would have expected this: (not sure about the Logical ID but since setting the affinity to 2^4 would make the thread run on the next physical CPU, I expected Physical ID to be 1).
Affinity mask |
Logical ID |
Physical ID |
2^0 |
0? |
0 |
2^1 |
2? |
0 |
2^2 |
4? |
0 |
2^3 |
6? |
0 |
2^4 |
0? |
1 |
2^5 |
2? |
1 |
2^6 |
4? |
1 |
2^7 |
6? |
1 |
2^8 |
1? |
0 |
2^9 |
3? |
0 |
2^10 |
5? |
0 |
2^11 |
7? |
0 |
2^12 |
1? |
1 |
2^13 |
3? |
1 |
2^14 |
5? |
1 |
2^15 |
7? |
1 |
Thanks for the help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) all logical ID's in logical ID orderper physical ID in physical ID order
2)lowest logical ID per HT siblingsin logical ID orderper physical ID in physical ID order followed by next higher logical ID per HT siblingsin logical ID orderper physical ID in physical ID order followed by next higher ID per HT siblings (assuming more than 2 HT per core). This appears to be your setting
3) Lowest logical ID per physical ID in physical ID order followed by next higher logical ID per physical ID in physical ID order, ... (without regard to HT siblings)
4) 3) above but sequencing one thread per HT siblings
The E5660 report indicates that each package consumes two physical ID's (one of the CPUID's should indicate number of physical ID's per package).
In the QuickThread threading toolkit that I wrote (www.quickthreadprogramming.com) I perform affinity pinning by system logical processor number (affinity bit mask order) then I use CPUID and CPUIDEX to build a proximity bitmask tableper thread per cache level per NUMA node such that the threading scheduler is aware of localities amongst available threads. The user application can then specify enqueuing to
self
within L1 distance(HT siblings)
within L2 distance
within L3 distance (usualy equivilent to socket)
within same NUMA node
within one hop NUMA distance
within two hops NUMA distance
within three hops NUMA distance
The above is an inclusion distance from thread issuing the enqueue.
There are additional control bits for exclusion, dispersion and availability.
This all sounds complicated but it is relatively trivial to use and implement internally.
//slice rowsby sockets
parallel_for(OneEach_L3$, doRows, 0, nRows, nCols, A, B, C);
...
void doRows(int RowN, int RowM, int nCols, double* A[], double* B[], double* C[])
{
// slice slice of rows by threads within socket
parallel_for(L3$, doTile, 0, nCols, RowN, RowM, A, B, C);
}
...
void doTile(int ColN, int ColM, int RowN, int RowM, double* A[], double* B[], double* C[])
{
for(int Row=RowN; Row .lt. RowM; ++Row)
for(int Col=ColN; Col .lt. ColM; ++Col)
C[Row][Col] =doSomething(A[Row][Col],B[Row][Col]);
}
You can alternately use lambda functions if you want or traditional functions as above.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- CCK
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page