Poor cache performance on Tigerton quad-core

Ron_B_ · ‎11-02-2007

For a large, memory / compute intensive application, we are concerned about possible FSB contention for Tigerton in the quad-core configuration. Although most test cases ran well in quad-core, one case with high cache demand saw an almost 2X increase in run time compared to Tigerton in a dual-core configuration (using Linux taskset to specify either 4 CPUs or 2 CPUs per physical socket).

The slow test case saw CPI double and MEM_LOAD_RETIRED.L2_LINE_MISS % double for the application as a whole. Individual functions had CPI as high as 52 and L2_LINE_MISS as high as 25%.

The question is, "What additional Tigerton Core events can be used to debug this situation?"

In attempting to understand the impact of FSB contention, we tried monitoring L2_REJECT_BUSQ.BOTH_CORES.ANY.MESI %, but this event does not correlate with L2_LINE_MISS.

Suggestions?

Also, cpuinfo reports 4096 KB cache. Is this shared between all 4 cores? Is the available cache per core double for the 2 CPU configuration?

Thanks,
Ron Bennett
ron_bennett@mentor.com

TimP · ‎11-03-2007

There is 4096KB L2 cache per pair of cores. One of the options with taskset would be to use one core for each L2 cache. Cache eviction events might be relevant, to confirm the apparent diagnosis that this application suffers from contention between cores on the same cache.

Anat_S_Intel · ‎11-05-2007

if all the threads read the same data, when running with 4 cores the same data has to be brought twice intotwo last level caches.

you can also check if the threads don't have data false sharing. EXT_SNOOP.THIS_AGENT.HITM/INST_RETIRED.ANY should be below 0.005 or elsethe test casesuffers greatly from data false sharing.

Ron_B_ · ‎11-07-2007

Tim,

Thanks for the info. It's not clear from /proc/cpuinfo on this system (see below) how the caches are shared. Can you point me to documentation on the Tigerton cache architecture and how this relates to processor number? Or I could just try different combinations...

Also, I only remember seeing single L2 miss events in VTune. Are these cummulative for all L2 caches? Can I get separate event data for each L2 cache?

Thanks,
Ron

/proc/cpuinfo (16 cores)

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5857.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.91
clflush size&n bsp; : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 2
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.93
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 2
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 4
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.94
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 4
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.97
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 6
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.96
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 6
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 8
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp ; : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 9
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5852.01
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 10
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 2
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush s ize : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 2
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.90
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 12
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 4
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 13
vendor_id : GenuineIntel
cpu family : 6
model : 15
mo del name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 4
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 14
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id : 6
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.95
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Genuine Intel CPU @ 2.93GHz
stepping : 11
cpu MHz : 2925.871
cache size : 4096 KB
physical id&nb sp; : 6
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca lahf_lm
bogomips : 5851.91
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

TimP · ‎11-07-2007

The SELF events refer to a given core. I'm away from the office, so don't have access to a machine where I could browse the available events, nor would I care to pretend to be an expert on them. There is an ambiguity in the assignment of BIOS numbered cores to caches, which is supposed to be resolved by additional data presented in /proc/sysinfo or such, in the latest distros (Red Hat 5,......) However, in my RH5 installations, those entries have been empty. Presumably, if you found that taskset -c 0-7 gave good performance when you need maximum cache per thread, you could infer that you have affinitized one thread per cache. It's common (but not guaranteed) on Xeon platforms to set up the BIOS so that the first threads are assigned to different sockets, as your /proc/cpuinfo shows that 0-3 are assigned to different sockets. Following that plan, the next threads (4-7) would be assigned to the other cache on each socket. The most pressing reason for arranging it this way would be to present maximum resources when the OS is booted with a fraction of the total number of cores active.
When you go beyond 1 thread per cache, it is important to assign pairs of threads to each cache so as to maximize sharing of cache lines. For example, this typically means that an OpenMP job must unscramble the BIOS ordering, so that adjacent OpenMP threads are affinitized to the same cache. I don't have experience with Tigerton, but it could mean that taskset -c 0,8,1,9,2,10, ... would be the way to maximize performance with all cores busy. Besides catering to the boundary between data belonging to 2 threads, where they hit the same cache line, it also minimizes the wastage of cache lines brought in by prefetch. Recent linux kernels have done quite well at teaching the scheduler good preferences.

Ron_B_ · ‎11-08-2007

Thanks. This is quite helpful. For this particular test, there is no shared memory between threads. The characterization is to determine if the cache miss penalty is exacerbated by FSB contention. But, I'm not sure we have the right events to answer that question.

I'll try the taskset configurations tomorrow to see if we can determine the BIOS ordering for this system.

Thanks,
Ron

Ron_B_ · ‎11-12-2007

Tim,

Taskset -c 0,8 showed a significant run time penalty, whereas 0,1 and 0,9 did not. From this, I assume that cores 0 and 8 share a cache, as do cores 1 and 9. The run time penalty is as follows:

2 cores, no shared cache: 1 X
2 cores, shared cache: 1.4 X
4 cores: 1.8 X

As I mentioned, most test cases perform fine on the 4 core configuration. This test case is used to stress cache performance, but is an indicator that some applications may run slower on Tigerton quads than on similar dual-core systems.

Assistance with VTune events for further characterization is appreciated.

Thanks,
ron

TimP · ‎11-12-2007

If you use an optimum affinity selection, you should be able to get at least as good performance from quad core as for the same number of threads on multiple socket dual core. Without affinity or a good scheduler in the OS, it could easily be worse. In a case like yours, it should be possible to get improved performance by giving each thread an entire L2 cache.

I was looking at a case today which appears to suffer from cache contention when running 2 threads per cache. The L1 cache eviction rates L1D_M_EVICT did go way up when I added a thread, while performance dropped, and of course the number of UOPS, particularly UOPS_RETIRED.CYCLES_NONE (cycles where no UOP is retired, a sort of stalled cycle count), went up a lot.

I'm at the other extreme, checking the possibility of gaining by threading an application which runs OK without threading on Core 2 Duo.

Ron_B_ · ‎02-21-2008

Tim,

I thought you might like a Harpertown update to this cache performance issue.

First, I wanted to clarify that except for this one test case, which is used to severely stress cache performance, we are seeing excellent per-core performance for the Core-based quads. The per-clock performance is essentially linear for single core, dual core, and quad core configurations, which was not the case for the Netburst-based processors due to FSB contention.

The performance hit on Harpertown is reduced by about 50% from Tigerton. Presumably this is due a combination of the faster FSB and larger cache on Harpertown. For example, in the dual-core configuration with two threads running on a shared cache, the 40% performance hit on Tigerton is reduced to 20% on Harpertown.

Ron