Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge

Pavel_Kogan
Beginner
1,751 Views

Hi all,

After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:

Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms]  Diff: 13%

Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms]  Diff: 36%

Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms]  Diff: 13%

My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?

Many thanks, Pavel.

0 Kudos
55 Replies
Bernard
Valued Contributor I
317 Views

@Pavel

Can you add to your test case a NUMA API functions and test it on both servers?

0 Kudos
perfwise
Beginner
317 Views

Just an inquiry... but how many threads are you running?  Are you running with hyperthreading?  What's the IPC of the threads, and what's the average B/instruction in each thread.  I wonder whether you're running out of ILD bandwidth in SB.  SB is more prone to this than IB.  Just a thought..

perfwise

0 Kudos
Bernard
Valued Contributor I
317 Views

@perfwise

Can the pressure build up on one of the execution Ports trigger rescheduling of the threads and moving threads further in the NUMA space.

0 Kudos
Bernard
Valued Contributor I
317 Views

Regarding NUMA related information it is also located in the KPRCB structure.

0 Kudos
Bernard
Valued Contributor I
317 Views

@Pavel

Regarding setting processor affinity you can use so called "Interrupt Affinity policy tool".You can download it from the Microsoft website.Bear un mind that those settings are related to the interrupt priority and can be used only when it is known that some driver's ISR is consuming too much processor resources.

0 Kudos
perfwise
Beginner
317 Views

iliyapolak,

    SB differs from IB, because if you're IPC is high enough, and you're not hitting in the DSB, then you can become starved for instructions from the front end.  That's one of the biggest differences between SB from NH/IB.  Once can identify if this is an issue by monitoring the # of uops delivered by the DSB and also the IPC.  If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance.  My experience tells me if you're trying to pull more than 8B per cycle from the ILD then you're not going to be a happy camper on SB, but IB can do so.  Must have been some issue which shipped with SB, that not a functional problem, degraded performance and was fixed in IB.  This is a hard to identify issue.. and I'm sure most don't know it exists, simply because it's likely rare to happen given the large % of time the DSB is delivering uops.

perfwise

0 Kudos
Pavel_Kogan
Beginner
317 Views

Thanks you all for very professional and useful feedback. We decided to stop the migration to SB till the affinity issue would be fixed in our code.

Pavel 

0 Kudos
Bernard
Valued Contributor I
317 Views

@perfwise

What "DSB" stands for?

 

0 Kudos
SergeyKostrov
Valued Contributor II
317 Views
>>@perfwise >> >>What "DSB" stands for? When somebody uses an abbreviation, like DSB, and doesn't explain what it means for everybody is the most unpleasant thing on any forum. Personally, I don't have time to "hunt down" on the Internet for all these unexplained abbreviations if I don't know what they mean.
0 Kudos
Bernard
Valued Contributor I
317 Views

Sergey Kostrov wrote:

>>@perfwise
>>
>>What "DSB" stands for?

When somebody uses an abbreviation, like DSB, and doesn't explain what it means for everybody is the most unpleasant thing on any forum. Personally, I don't have time to "hunt down" on the Internet for all these unexplained abbreviations if I don't know what they mean.

Yes I completely agree with you.

0 Kudos
perfwise
Beginner
317 Views

DSB is short for decode stream buffer... which is yhe uop cache.   Probanly googleable.  I asked that question on this forum and there is a topic I started here with that and many more acronyms which intel uses.

0 Kudos
Bernard
Valued Contributor I
317 Views

>>> If you're at 3+ in IPC.. and you're not hitting in the DSB>>>

This is obvious, but SB sustained rate of uops per cycle should be 4 uops and even AVX instruction are decoded into single uops.So when you are dealing with the code which uses a lot of SSE/AVX instructions which for execution need  <3 uops there should not be any starvation.

0 Kudos
perfwise
Beginner
317 Views

Excuse me, what is obvious to you?

You don't have uops from the ILD if you can't fetch enough B to decode, seems to make sense to me.  This was my original inquiry, how big are the instructions and what's the ipc.  If you haven't tried running many tests to isolate the theoretical limits of the ILD or DSB upon your chips, then don't treat anything as obvious.   It is unpleasent, esp. for someone like me who has run those tests and knows the differences in capability between the ILD on SB and IB, to have you state this is obvious.

I've determined in my workloads, my customers and through the SPEC suite what the sources of uops are (ILD, DSB or MS), are upon SB and IB.  If you've not done that or you haven't written directed tests to study these issues and collected performance counter data, please refrain from stating something is obvious.  

Perfwise

0 Kudos
Bernard
Valued Contributor I
317 Views

 This sentence is obvious to me "If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance".You probably misunderstood my post.

 

And it is clear that there is direct dependency on the front end x86 instructions fetching bandwidth.

0 Kudos
Reply