Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Very low L3 cache hits on Sandy Bridge according to PCM

matt_garman
Beginner
671 Views

We develop ultra low latency proprietary software.  We are struggling with the transition from Westmere to Sandy Bridge, as our programs seem to run 10--30% slower on an SNB E5-2690 versus a Westmere X5690.  To investigate this, we integrated PCM v2.2a into our code (downloaded from [1], although as of now it looks like v2.3 is available).  We ran two parallel instances of our program suite, one on Westmere and one on SNB.

What we saw was an average L3 hit ratio of about 73% for Westmere, but only about 17% for Sandy Bridge.  If anything, we expected that the L3 hit ratio should be much higher on SNB, because the cache is significantly larger (20MB versus 12MB).  We suspect that this is why we are seeing such a big performance discrepancy between Westmere and SNB.

So it appears that either (1) something is churning through the CPU, killing our cache, or (2) we have missed disabling some power-saving feature.

We are running CentOS (RedHat) v5.7.  Our kernel commandline looks like this: "intel_idle.max_cstate=0 selinux=0 idle=poll nox2apic intremap=off processor.max_cstate=0 nohalt isolcpus=2-15".  (We isolate CPUs from the scheduler so that we can programatically pin the most latency-sensitive threads to individual cores.)  We have disabled just about all system services (at least the running services are the same on both Westmere and SNB systems).  At the BIOS level, we have disabled all the power-saving features we possibly can (including C-states and C1E).  (The SNB machine is a Dell R620, and we followed Dell's low latency tuning guide[2].)

Anyone have any thoughts on what might be causing such low L3 cache utilization?

[1] http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

[2] http://www.dell.com/us/enterprise/p/d/shared-content~data-sheets~en/Documents~configuring-low-latency-environments-on-dell-poweredge-12g-servers.pdf.aspx

0 Kudos
14 Replies
Roman_D_Intel
Employee
671 Views
Hi, what about the absolute number of L3 cache misses on Westmere vs. Sandy Bridge? Is your application single-threaded or multithreaded? It would help if you can post the output of pcm.x here from both systems. Roman
0 Kudos
matt_garman
Beginner
671 Views
Our program is multithreaded. However, only one thread is performance critical. So what we do is use the "isolcpus" kernel parameter to isolate cores from the kernel process scheduler. We launch the application as normal, but programmatically pin the performance-critical thread to one of the isolated cores (pthread_setaffinity_np()). Because we are only interested in the performance of that one thread (and not the program as a whole), we are not using pcm.x. Instead, we modified our code to collect PCM stats on just that one thread. We didn't collect absolute cache misses, only hit ratios, and cpu cycle lost ratios. The way we are looking at it, we assumed that a low hit ratio would imply a high number of absolute misses. The programs and OS config (save the kernel command line) are identical. Is it incorrect to assume the caching behavior should be similar on the two machines? If so, why? I'll add that, along with the lousy 17% L3 hit ratio, we also see nearly 40% of CPU cycles lost due to L3 cache misses. Versus Westmere, where we had 73% L3 hit ratio, and only 8% of CPU cycles lost due to L3 misses. We can further modify our code to include the absolute number of cache misses. But in the meantime, I'm interested in knowing what additional information that will provide. (Not trying to be argumentative, these are honest questions!) Thanks!
0 Kudos
matt_garman
Beginner
671 Views
Hi, we modified our program to collect additional PCM stats. We don't use pcm.x because we are only concerned with the performance of one thread (which is running on an isolated core). Here is one run of SNB versus WSM: Sandy Bridge: EXEC: 0.7332 IPC: 0.6451 FREQ: 1.1366 AFREQ: 1.1379 L3MISS: 89948 L2MISS: 164103 L3HIT Ratios: 0.4800 L2HIT Ratios: 0.6134 L3CLK: 0.1474 L2CLK: 0.0327 Westmere: EXEC: 0.5902 IPC: 0.5594 FREQ: 1.0550 AFREQ: 1.0555 L3MISS: 54319 L2MISS: 253901 L3HIT: 0.7625 L2HIT: 0.5025 L3CLK: 0.0755 L2CLK: 0.0591 Note that is for 1020 iterations of this thread. The same exact binary, same operating system, same CPU isolation scheme, same input data. Only difference is CPU architecture. Sandy Bridge had 48% L3 hit ratio, 15% of CPU cycles lost due to L3 misses, and 90k L3 misses. Westmere had 75% L3 hit ratio, 8% of CPU cycles lost due to L3 misses, and only 54k L3 misses. Thanks! Matt
0 Kudos
matt_garman
Beginner
671 Views
By the way, one idea was that perhaps SNB's fancy new pre-fetching was simply not working for our particular code. The BIOS for my Sandy Bridge machine (Dell R620) has four options for pre-fetching: Adjacent Cache Line Prefetch Hardware Prefetcher DCU Streamer Prefetcher DCU IP Prefetcher I repeated my test for every possible combination of these settings enabled/disabled (16 total combinations). With any one option disabled, the L3 hit ratios actually got worse still.
0 Kudos
Roman_D_Intel
Employee
671 Views
Matt, thank you for sharing the metrics. From the absolute number of cache misses it seems they are very short (can contain some noise and overhead included) - 10-100 ms? Nevertheless lets do some analysis: We can compute the instruction throughput for both system by multiplying the nominal frequency with EXEC: Westmere: 3.46 x 0.59 = 2.04 G instructions/second SNB: 2.9 x 0.73 = 2.12 G instructions/second. This means SNB executes more instructions per second than the Westmere. But your program runs slower on SNB! This makes me think that on SNB you execute (slightly) different code path compared to Westmere. You really need a profiler to see what code is executed and hot both on Westmere and on SNB and compare. Thanks, Roman
0 Kudos
matt_garman
Beginner
671 Views
Roman Dementiev (Intel) wrote:

This means SNB executes more instructions per second than the Westmere. But your program runs slower on SNB! This makes me think that on SNB you execute (slightly) different code path compared to Westmere. You really need a profiler to see what code is executed and hot both on Westmere and on SNB and compare.

This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct. We have done profiling as you suggest. We put about a dozen rdtscp calls in this code path, stored them in a table, and output them at the end of execution. We were hoping to isolate one section of code that was particularly slow on SNB versus WSM. But, it was just a general overall slowness. All of our "stopwatch" points were just a little bit higher on SNB. In other words, the added execution time was for the most part evenly distributed. Any other ideas? Thanks, Matt
0 Kudos
Roman_D_Intel
Employee
671 Views
This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.
does your code execute any spin locks or other thread synchronization primitives that may have different number of executed iterations in spin loops? Best regards, Roman
0 Kudos
matt_garman
Beginner
671 Views
Roman Dementiev (Intel) wrote:

This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.

does your code execute any spin locks or other thread synchronization primitives that may have different number of executed iterations in spin loops?

Yes, that is a good point that I didn't originally consider. We use Linux pthreads, which IIRC do use some spin locks "under the covers". Also: this is an event-driven thread. That thread is spinning on a SysV message queue waiting for the next event. Note that I also tested without the busy event waiting: this has virtually no impact on the L3 hit ratio of Westmere, but drops the SNB L3 hit ratio even more. Thank you, I appreciate your feedback. -Matt
0 Kudos
Dmitri
Beginner
671 Views

We seem to be experiencing similar issues. I'm wondering if there's any progress on this subject.

0 Kudos
Roman_D_Intel
Employee
671 Views

Hi Matt,

just released Intel PCM 2.35 fixes sometimes wrong cache statistics by applying a special workaround for Intel Xeon E5 (based on Intel microarchitecture codenamed Sandy Bridge-EP and Sandy Bridge-E). It make sense to remeasure cache statistics with this version.

Best regards,

Roman

0 Kudos
Pavel_Kogan
Beginner
671 Views

I think we have the same problem :(

Old Dual Xeon E5645 2.4GHz (Nehalem) performs the same as new Dual Xeon E5-2620 2.0GHz (Sandy bridge) when running in single main thread, but outperforms E5 by almost factor x2 when running in multiple threads. The executable is the same.

Regards, Pavel

0 Kudos
James_D_1
Beginner
671 Views

I observed similar performance (missed cache, running time) degradation when running frequent lock-prefix primitives with multiple threads. Any progress in this thread? or Any suggestion?

Regards,
James

0 Kudos
Roman_D_Intel
Employee
671 Views

James,

too frequent locked/atomic operations might be the reason of non-scaling. There is a new study discussing such trade-offs on new processor architectures: "Lock Scaling Analysis on Intel® Xeon® Processors".

Roman

0 Kudos
matt_garman
Beginner
671 Views

I don't know if it's related to locking or not, but for us, the "magic bullet" was a Linux kernel version change. In particular, see this changelog:

    http://rpm.pbone.net/index.php3/stat/22/idpl/16999719/com/changelog.html

I believe the change that helped us is this one:

    [x86_64] Revert ACPI APIC mode test (Prarit Bhargava) [728163 721361]

I actually contacted Prarit and asked him about this. He said that they were trying to use system ACPI tables to program interrupts (APIC). However, many systems have their ACPI tables wrong for whatever reason, which resulted in incorrect APIC settings. Having the interrupts programmed incorrectly can result in very poor system performance.

Note that this patch improved the performance of both Westmere and Sandy Bridge, but the improvement was much more dramatic on SNB.

-Matt

0 Kudos
Reply