Hi all,
After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:
Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%
Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%
Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%
My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?
Many thanks, Pavel.
@Pavel
It is not an easy question to answer.There also very scarce information about the NUMA in Intel SDM.I'm posting a link to very interesting discussion about the NUMA related performance.I posted there a few links one of them gives detailed explanation of the NUMA performance degradation.
Link to the post :://software.intel.com/en-us/forums/topic/346334
Very interesting information regarding NUMA performation degradation link ://communities.vmware.com/thread/391284
@Pavel
I posted a few links to the very interested discussion also related to the NUMA and performance degradation.Unfortunately still my posts are queued for the admin approval so I'm posting below a part of my answer on that discussion.
>>>Probably NUMA architecture - related memory distances coupled with the thread beign executed by the different nodes and forced to access its non-local memory could be responsible for any performance degradation related to the memory accesses. When the number of nodes is greater than 1 some performance penalty will be expected.IIRC the performance penalty is measured in unit of "NUMA distance" with the normalized value of 10 and every access to the local memory has the cost of 10(normalized value) thats mean 1.0 when the process accesses off-node memory(remote) from the NUMA "point of view" some penalty will be added because of overhead related to moving data over the numa interlink.Accessing neighbouring node can add up to 0.4 performance penalty so the total penalty can reach 1.4. More information can be found in ACPI documentation>>>
@Pavel
Can you add to your test case a NUMA API functions and test it on both servers?
Just an inquiry... but how many threads are you running? Are you running with hyperthreading? What's the IPC of the threads, and what's the average B/instruction in each thread. I wonder whether you're running out of ILD bandwidth in SB. SB is more prone to this than IB. Just a thought..
perfwise
@perfwise
Can the pressure build up on one of the execution Ports trigger rescheduling of the threads and moving threads further in the NUMA space.
Regarding NUMA related information it is also located in the KPRCB structure.
@Pavel
Regarding setting processor affinity you can use so called "Interrupt Affinity policy tool".You can download it from the Microsoft website.Bear un mind that those settings are related to the interrupt priority and can be used only when it is known that some driver's ISR is consuming too much processor resources.
iliyapolak,
SB differs from IB, because if you're IPC is high enough, and you're not hitting in the DSB, then you can become starved for instructions from the front end. That's one of the biggest differences between SB from NH/IB. Once can identify if this is an issue by monitoring the # of uops delivered by the DSB and also the IPC. If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance. My experience tells me if you're trying to pull more than 8B per cycle from the ILD then you're not going to be a happy camper on SB, but IB can do so. Must have been some issue which shipped with SB, that not a functional problem, degraded performance and was fixed in IB. This is a hard to identify issue.. and I'm sure most don't know it exists, simply because it's likely rare to happen given the large % of time the DSB is delivering uops.
perfwise
Thanks you all for very professional and useful feedback. We decided to stop the migration to SB till the affinity issue would be fixed in our code.
Pavel
@perfwise
What "DSB" stands for?
Sergey Kostrov wrote:
>>@perfwise
>>
>>What "DSB" stands for?When somebody uses an abbreviation, like DSB, and doesn't explain what it means for everybody is the most unpleasant thing on any forum. Personally, I don't have time to "hunt down" on the Internet for all these unexplained abbreviations if I don't know what they mean.
Yes I completely agree with you.
DSB is short for decode stream buffer... which is yhe uop cache. Probanly googleable. I asked that question on this forum and there is a topic I started here with that and many more acronyms which intel uses.
>>> If you're at 3+ in IPC.. and you're not hitting in the DSB>>>
This is obvious, but SB sustained rate of uops per cycle should be 4 uops and even AVX instruction are decoded into single uops.So when you are dealing with the code which uses a lot of SSE/AVX instructions which for execution need <3 uops there should not be any starvation.
Excuse me, what is obvious to you?
You don't have uops from the ILD if you can't fetch enough B to decode, seems to make sense to me. This was my original inquiry, how big are the instructions and what's the ipc. If you haven't tried running many tests to isolate the theoretical limits of the ILD or DSB upon your chips, then don't treat anything as obvious. It is unpleasent, esp. for someone like me who has run those tests and knows the differences in capability between the ILD on SB and IB, to have you state this is obvious.
I've determined in my workloads, my customers and through the SPEC suite what the sources of uops are (ILD, DSB or MS), are upon SB and IB. If you've not done that or you haven't written directed tests to study these issues and collected performance counter data, please refrain from stating something is obvious.
Perfwise
This sentence is obvious to me "If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance".You probably misunderstood my post.
And it is clear that there is direct dependency on the front end x86 instructions fetching bandwidth.
For more complete information about compiler optimizations, see our Optimization Notice.