- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:
Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%
Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%
Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%
My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?
Many thanks, Pavel.
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Pavel
Can you add to your test case a NUMA API functions and test it on both servers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just an inquiry... but how many threads are you running? Are you running with hyperthreading? What's the IPC of the threads, and what's the average B/instruction in each thread. I wonder whether you're running out of ILD bandwidth in SB. SB is more prone to this than IB. Just a thought..
perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@perfwise
Can the pressure build up on one of the execution Ports trigger rescheduling of the threads and moving threads further in the NUMA space.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding NUMA related information it is also located in the KPRCB structure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Pavel
Regarding setting processor affinity you can use so called "Interrupt Affinity policy tool".You can download it from the Microsoft website.Bear un mind that those settings are related to the interrupt priority and can be used only when it is known that some driver's ISR is consuming too much processor resources.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak,
SB differs from IB, because if you're IPC is high enough, and you're not hitting in the DSB, then you can become starved for instructions from the front end. That's one of the biggest differences between SB from NH/IB. Once can identify if this is an issue by monitoring the # of uops delivered by the DSB and also the IPC. If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance. My experience tells me if you're trying to pull more than 8B per cycle from the ILD then you're not going to be a happy camper on SB, but IB can do so. Must have been some issue which shipped with SB, that not a functional problem, degraded performance and was fixed in IB. This is a hard to identify issue.. and I'm sure most don't know it exists, simply because it's likely rare to happen given the large % of time the DSB is delivering uops.
perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks you all for very professional and useful feedback. We decided to stop the migration to SB till the affinity issue would be fixed in our code.
Pavel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@perfwise
What "DSB" stands for?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>@perfwise
>>
>>What "DSB" stands for?When somebody uses an abbreviation, like DSB, and doesn't explain what it means for everybody is the most unpleasant thing on any forum. Personally, I don't have time to "hunt down" on the Internet for all these unexplained abbreviations if I don't know what they mean.
Yes I completely agree with you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DSB is short for decode stream buffer... which is yhe uop cache. Probanly googleable. I asked that question on this forum and there is a topic I started here with that and many more acronyms which intel uses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> If you're at 3+ in IPC.. and you're not hitting in the DSB>>>
This is obvious, but SB sustained rate of uops per cycle should be 4 uops and even AVX instruction are decoded into single uops.So when you are dealing with the code which uses a lot of SSE/AVX instructions which for execution need <3 uops there should not be any starvation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Excuse me, what is obvious to you?
You don't have uops from the ILD if you can't fetch enough B to decode, seems to make sense to me. This was my original inquiry, how big are the instructions and what's the ipc. If you haven't tried running many tests to isolate the theoretical limits of the ILD or DSB upon your chips, then don't treat anything as obvious. It is unpleasent, esp. for someone like me who has run those tests and knows the differences in capability between the ILD on SB and IB, to have you state this is obvious.
I've determined in my workloads, my customers and through the SPEC suite what the sources of uops are (ILD, DSB or MS), are upon SB and IB. If you've not done that or you haven't written directed tests to study these issues and collected performance counter data, please refrain from stating something is obvious.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This sentence is obvious to me "If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance".You probably misunderstood my post.
And it is clear that there is direct dependency on the front end x86 instructions fetching bandwidth.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »