Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Beginner
126 Views

Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge

Hi all,

After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:

Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms]  Diff: 13%

Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms]  Diff: 36%

Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms]  Diff: 13%

My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?

Many thanks, Pavel.

0 Kudos
55 Replies
Black Belt
23 Views

>>>My Dell Precision M4700 with Windows 7 Professional 64-bit OS is highly optimized for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working>>>

Disabling network adapter is wise decision because of servicing network card incured interrupts and further packet processing can hog down the CPU.I would also recommend to run from time to time general system monitoring with the help of Xperf tool you will get a very detailed breakdown of various activity.Moreover it is recommended to disable(when you are not connected to the Internet) your AV software.It is known that for example Kaspersky AV uses system wide hooks and detours to check system function callers and this activity can add to the load on CPU.Moreover AV often installs custom drivers used to gain access into various internal OS structures implemented in kernel and this activity is sometimes done at IRQL == DPC_LEVEL mostly for synchronization and can block scheduler which also runs at DPC_LEVEL so uninstalling an AV on developer's machine is highly recommended.

0 Kudos
Black Belt
23 Views

That would be nice.

>>>You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. >>>

Thanks for informing me about this.I completely did not take it into account.

0 Kudos
Black Belt
23 Views

>>>Hello Pavel,

Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop>>>

Hi Patrick!

Xperf has some thread creation and context switching timing and monitoring abilities.By default it is system-wide ,but I think that there is possiblity to launch the monitored process directly by using xperf.Or it could be done programmaticaly.

0 Kudos
Valued Contributor II
23 Views

Pavel, Here is another advise: Add a call to getch CRT-function in the main function at the very beginning, like: ... ...main(...) { getch(); ... } While the test application waits for input from the keyboard open Windows Task Manager and select the test application. Then, "force" execution of the test just on one CPU ( use Set Affinity item from the popup dialog ). Also, take a look at how many threads will be created when the test will continue execution.
0 Kudos
23 Views

So this is really simple. 

Change the Run routine from:

 void Run()
{
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data;
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
}

to something like:

 void Run()
{
QueryPerformanceCounter(&start2);
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data;
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
QueryPerformanceCounter(&finish2);
timeMs2 += (double)(finish.QuadPart - start.QuadPart) ;
}

 where timeMs2 is an global variable.

Then you can compare the time inside Run() with the time outside Run() and see if (as I expect) the time spent inside the Run() code is exactly the same for the 2 (-fast and not -fast) cases.

No need to mess with xperf or anything complicated yet.

Pat

 

0 Kudos
Valued Contributor II
23 Views

Thanks, Patrick! I'll run another set of tests on my Ivy Bridge and results will be posted by Monday.
0 Kudos
Black Belt
23 Views

>>>No need to mess with xperf or anything complicated yet.>>>

Hi Pat!

QueryPerformanceCounter used exactly as in your code snippet will not provide any timing information about the time spent in thread creation routines.One of the initial thread startter's question was "how to measure latency(overhead) of thread creation routines.

0 Kudos
23 Views

The code with my suggested changes is basically:


start_timer1
create thread (or not)
start_timer2
do_work_in_loop
end_timer2
end thread (if created)
end_timer1

If you create a thread, it seems like the difference between timer1 and time2 should be the overhead of creating the thread.

And the 2 timers would verify that the same amount of time spent in 'do_work_in_loop'. If the time is not the same, then something unexpected (but not unprecedented) is going on.

0 Kudos
Beginner
23 Views

Hi all,

The problem of performance may be resolved by affinity mask - I tried it and it worked. I have two sockets in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than Nahalem - there must be an Intel bug.

Regards, Pavel

0 Kudos
23 Views

If affinity mask fixes the performance and you have 2 sockets, then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? This would cause the sandy bridge system with the new thread to do remote memory accesses whereas the single threaded version does local memory access.

Do you have NUMA enabled on the sandy bridge?

Do you have NUMA enabled on the nehalem box?

0 Kudos
Beginner
23 Views

Hi Patrick,

What is NUMA and where is may be enabled? In BIOS?

Thanks, Pavel

0 Kudos
Valued Contributor II
23 Views

Thanks for the update, Pavel. When I tested your codes I saw that 4 threads were created, one thread for one CPU, and I expect ( sorry, didn't have time for investigation with VTune ) they were "fighting" for access to data but in overall the test with '--fast' switch worked faster on my Ivy Bridge. >>... in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than >>Nahalem - there must be an Intel bug. Of course it is possible but it needs to be proven. Please provide more details and a new test case if you think so. Best regards, Sergey
0 Kudos
Black Belt
23 Views

>>>then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? >>>

Yes it could be also  NUMA related issue.

>>>What is NUMA and where is may be enabled?>>>

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

0 Kudos
Black Belt
23 Views

@Pavel

Here is very interesting discussion about the NUMA performance : http://software.intel.com/en-us/forums/topic/346334

0 Kudos
Valued Contributor II
23 Views

Pavel, Could you check specs of your hardware in order to confirm that you have NUMA system(s)? Thanks in advance.
0 Kudos
Black Belt
23 Views

Pavel has dual Xeon motherboard so it is a NUMA system.

0 Kudos
Beginner
23 Views

If both systems are NUMA and I am not using affinity masks in my code. How one system can run faster than other?

How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?

Thanks, Pavel

0 Kudos
Black Belt
23 Views

>>>How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?>>>

You check for NUMA nodes programmaticaly.Please consult this reference:msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx

0 Kudos
Beginner
23 Views

UPD: Both servers are NUMA enabled.

0 Kudos
Beginner
23 Views

Can the cost of remote memory use be so much higher in Sandy bridge?

0 Kudos