- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:
Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%
Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%
Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%
My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?
Many thanks, Pavel.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That would be nice.
>>>You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. >>>
Thanks for informing me about this.I completely did not take it into account.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Hello Pavel,
Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop>>>
Hi Patrick!
Xperf has some thread creation and context switching timing and monitoring abilities.By default it is system-wide ,but I think that there is possiblity to launch the monitored process directly by using xperf.Or it could be done programmaticaly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So this is really simple.
Change the Run routine from:
void Run()
{
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data;
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
}
to something like:
void Run()
{
QueryPerformanceCounter(&start2);
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data;
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
QueryPerformanceCounter(&finish2);
timeMs2 += (double)(finish.QuadPart - start.QuadPart) ;
}
where timeMs2 is an global variable.
Then you can compare the time inside Run() with the time outside Run() and see if (as I expect) the time spent inside the Run() code is exactly the same for the 2 (-fast and not -fast) cases.
No need to mess with xperf or anything complicated yet.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>No need to mess with xperf or anything complicated yet.>>>
Hi Pat!
QueryPerformanceCounter used exactly as in your code snippet will not provide any timing information about the time spent in thread creation routines.One of the initial thread startter's question was "how to measure latency(overhead) of thread creation routines.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code with my suggested changes is basically:
start_timer1
create thread (or not)
start_timer2
do_work_in_loop
end_timer2
end thread (if created)
end_timer1
If you create a thread, it seems like the difference between timer1 and time2 should be the overhead of creating the thread.
And the 2 timers would verify that the same amount of time spent in 'do_work_in_loop'. If the time is not the same, then something unexpected (but not unprecedented) is going on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
The problem of performance may be resolved by affinity mask - I tried it and it worked. I have two sockets in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than Nahalem - there must be an Intel bug.
Regards, Pavel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If affinity mask fixes the performance and you have 2 sockets, then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? This would cause the sandy bridge system with the new thread to do remote memory accesses whereas the single threaded version does local memory access.
Do you have NUMA enabled on the sandy bridge?
Do you have NUMA enabled on the nehalem box?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Patrick,
What is NUMA and where is may be enabled? In BIOS?
Thanks, Pavel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? >>>
Yes it could be also NUMA related issue.
>>>What is NUMA and where is may be enabled?>>>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Pavel
Here is very interesting discussion about the NUMA performance : http://software.intel.com/en-us/forums/topic/346334
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pavel has dual Xeon motherboard so it is a NUMA system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If both systems are NUMA and I am not using affinity masks in my code. How one system can run faster than other?
How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?
Thanks, Pavel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?>>>
You check for NUMA nodes programmaticaly.Please consult this reference:msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
UPD: Both servers are NUMA enabled.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can the cost of remote memory use be so much higher in Sandy bridge?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Pavel
It is not an easy question to answer.There also very scarce information about the NUMA in Intel SDM.I'm posting a link to very interesting discussion about the NUMA related performance.I posted there a few links one of them gives detailed explanation of the NUMA performance degradation.
Link to the post :://software.intel.com/en-us/forums/topic/346334
Very interesting information regarding NUMA performation degradation link ://communities.vmware.com/thread/391284
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Pavel
I posted a few links to the very interested discussion also related to the NUMA and performance degradation.Unfortunately still my posts are queued for the admin approval so I'm posting below a part of my answer on that discussion.
>>>Probably NUMA architecture - related memory distances coupled with the thread beign executed by the different nodes and forced to access its non-local memory could be responsible for any performance degradation related to the memory accesses. When the number of nodes is greater than 1 some performance penalty will be expected.IIRC the performance penalty is measured in unit of "NUMA distance" with the normalized value of 10 and every access to the local memory has the cost of 10(normalized value) thats mean 1.0 when the process accesses off-node memory(remote) from the NUMA "point of view" some penalty will be added because of overhead related to moving data over the numa interlink.Accessing neighbouring node can add up to 0.4 performance penalty so the total penalty can reach 1.4. More information can be found in ACPI documentation>>>
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page