We develop ultra low latency proprietary software. We are struggling with the transition from Westmere to Sandy Bridge, as our programs seem to run 10--30% slower on an SNB E5-2690 versus a Westmere X5690. To investigate this, we integrated PCM v2.2a into our code (downloaded from , although as of now it looks like v2.3 is available). We ran two parallel instances of our program suite, one on Westmere and one on SNB.
What we saw was an average L3 hit ratio of about 73% for Westmere, but only about 17% for Sandy Bridge. If anything, we expected that the L3 hit ratio should be much higher on SNB, because the cache is significantly larger (20MB versus 12MB). We suspect that this is why we are seeing such a big performance discrepancy between Westmere and SNB.
So it appears that either (1) something is churning through the CPU, killing our cache, or (2) we have missed disabling some power-saving feature.
We are running CentOS (RedHat) v5.7. Our kernel commandline looks like this: "intel_idle.max_cstate=0 selinux=0 idle=poll nox2apic intremap=off processor.max_cstate=0 nohalt isolcpus=2-15". (We isolate CPUs from the scheduler so that we can programatically pin the most latency-sensitive threads to individual cores.) We have disabled just about all system services (at least the running services are the same on both Westmere and SNB systems). At the BIOS level, we have disabled all the power-saving features we possibly can (including C-states and C1E). (The SNB machine is a Dell R620, and we followed Dell's low latency tuning guide.)
Anyone have any thoughts on what might be causing such low L3 cache utilization?
Roman Dementiev (Intel) wrote:This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct. We have done profiling as you suggest. We put about a dozen rdtscp calls in this code path, stored them in a table, and output them at the end of execution. We were hoping to isolate one section of code that was particularly slow on SNB versus WSM. But, it was just a general overall slowness. All of our "stopwatch" points were just a little bit higher on SNB. In other words, the added execution time was for the most part evenly distributed. Any other ideas? Thanks, Matt
This means SNB executes more instructions per second than the Westmere. But your program runs slower on SNB! This makes me think that on SNB you execute (slightly) different code path compared to Westmere. You really need a profiler to see what code is executed and hot both on Westmere and on SNB and compare.
This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.does your code execute any spin locks or other thread synchronization primitives that may have different number of executed iterations in spin loops? Best regards, Roman
Roman Dementiev (Intel) wrote:Yes, that is a good point that I didn't originally consider. We use Linux pthreads, which IIRC do use some spin locks "under the covers". Also: this is an event-driven thread. That thread is spinning on a SysV message queue waiting for the next event. Note that I also tested without the busy event waiting: this has virtually no impact on the L3 hit ratio of Westmere, but drops the SNB L3 hit ratio even more. Thank you, I appreciate your feedback. -Matt
This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.
does your code execute any spin locks or other thread synchronization primitives that may have different number of executed iterations in spin loops?
just released Intel PCM 2.35 fixes sometimes wrong cache statistics by applying a special workaround for Intel Xeon E5 (based on Intel microarchitecture codenamed Sandy Bridge-EP and Sandy Bridge-E). It make sense to remeasure cache statistics with this version.
I think we have the same problem :(
Old Dual Xeon E5645 2.4GHz (Nehalem) performs the same as new Dual Xeon E5-2620 2.0GHz (Sandy bridge) when running in single main thread, but outperforms E5 by almost factor x2 when running in multiple threads. The executable is the same.
I observed similar performance (missed cache, running time) degradation when running frequent lock-prefix primitives with multiple threads. Any progress in this thread? or Any suggestion?
too frequent locked/atomic operations might be the reason of non-scaling. There is a new study discussing such trade-offs on new processor architectures: "Lock Scaling Analysis on Intel® Xeon® Processors".
I don't know if it's related to locking or not, but for us, the "magic bullet" was a Linux kernel version change. In particular, see this changelog:
I believe the change that helped us is this one:
[x86_64] Revert ACPI APIC mode test (Prarit Bhargava) [728163 721361]
I actually contacted Prarit and asked him about this. He said that they were trying to use system ACPI tables to program interrupts (APIC). However, many systems have their ACPI tables wrong for whatever reason, which resulted in incorrect APIC settings. Having the interrupts programmed incorrectly can result in very poor system performance.
Note that this patch improved the performance of both Westmere and Sandy Bridge, but the improvement was much more dramatic on SNB.