>>>No need to mess with xperf - Page 2

Pavel_Kogan · ‎02-18-2013

Hi all,

After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:

Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%

Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%

Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%

My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?

Many thanks, Pavel.

Bernard · ‎02-21-2013

That would be nice.

>>>You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. >>>

Thanks for informing me about this.I completely did not take it into account.

Bernard · ‎02-21-2013

>>>Hello Pavel,

Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop>>>

Hi Patrick!

Xperf has some thread creation and context switching timing and monitoring abilities.By default it is system-wide ,but I think that there is possiblity to launch the monitored process directly by using xperf.Or it could be done programmaticaly.

SergeyKostrov · ‎02-21-2013

Pavel, Here is another advise: Add a call to getch CRT-function in the main function at the very beginning, like: ... ...main(...) { getch(); ... } While the test application waits for input from the keyboard open Windows Task Manager and select the test application. Then, "force" execution of the test just on one CPU ( use Set Affinity item from the popup dialog ). Also, take a look at how many threads will be created when the test will continue execution.

Patrick_F_Intel1 · ‎02-22-2013

So this is really simple.

Change the Run routine from:

void Run()
{
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data;
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
}

to something like:

void Run()
{
QueryPerformanceCounter(&start2);
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data;
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
QueryPerformanceCounter(&finish2);
timeMs2 += (double)(finish.QuadPart - start.QuadPart) ;
}

where timeMs2 is an global variable.

Then you can compare the time inside Run() with the time outside Run() and see if (as I expect) the time spent inside the Run() code is exactly the same for the 2 (-fast and not -fast) cases.

No need to mess with xperf or anything complicated yet.

Pat

SergeyKostrov · ‎02-22-2013

Thanks, Patrick! I'll run another set of tests on my Ivy Bridge and results will be posted by Monday.

Bernard · ‎02-23-2013

>>>No need to mess with xperf or anything complicated yet.>>>

Hi Pat!

QueryPerformanceCounter used exactly as in your code snippet will not provide any timing information about the time spent in thread creation routines.One of the initial thread startter's question was "how to measure latency(overhead) of thread creation routines.

Patrick_F_Intel1 · ‎02-24-2013

The code with my suggested changes is basically:

start_timer1
create thread (or not)
start_timer2
do_work_in_loop
end_timer2
end thread (if created)
end_timer1

If you create a thread, it seems like the difference between timer1 and time2 should be the overhead of creating the thread.

And the 2 timers would verify that the same amount of time spent in 'do_work_in_loop'. If the time is not the same, then something unexpected (but not unprecedented) is going on.

Pavel_Kogan · ‎02-24-2013

Hi all,

The problem of performance may be resolved by affinity mask - I tried it and it worked. I have two sockets in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than Nahalem - there must be an Intel bug.

Regards, Pavel

Patrick_F_Intel1 · ‎02-24-2013

If affinity mask fixes the performance and you have 2 sockets, then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? This would cause the sandy bridge system with the new thread to do remote memory accesses whereas the single threaded version does local memory access.

Do you have NUMA enabled on the sandy bridge?

Do you have NUMA enabled on the nehalem box?

Pavel_Kogan · ‎02-24-2013

Hi Patrick,

What is NUMA and where is may be enabled? In BIOS?

Thanks, Pavel

SergeyKostrov · ‎02-24-2013

Thanks for the update, Pavel. When I tested your codes I saw that 4 threads were created, one thread for one CPU, and I expect ( sorry, didn't have time for investigation with VTune ) they were "fighting" for access to data but in overall the test with '--fast' switch worked faster on my Ivy Bridge. >>... in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than >>Nahalem - there must be an Intel bug. Of course it is possible but it needs to be proven. Please provide more details and a new test case if you think so. Best regards, Sergey

Bernard · ‎02-24-2013

>>>then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? >>>

Yes it could be also NUMA related issue.

>>>What is NUMA and where is may be enabled?>>>

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

Bernard · ‎02-24-2013

@Pavel

Here is very interesting discussion about the NUMA performance : http://software.intel.com/en-us/forums/topic/346334

SergeyKostrov · ‎02-24-2013

Pavel, Could you check specs of your hardware in order to confirm that you have NUMA system(s)? Thanks in advance.

Bernard · ‎02-24-2013

Pavel has dual Xeon motherboard so it is a NUMA system.

Pavel_Kogan · ‎02-25-2013

If both systems are NUMA and I am not using affinity masks in my code. How one system can run faster than other?

How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?

Thanks, Pavel

Bernard · ‎02-25-2013

>>>How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?>>>

You check for NUMA nodes programmaticaly.Please consult this reference:msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx

Pavel_Kogan · ‎02-25-2013

UPD: Both servers are NUMA enabled.

Pavel_Kogan · ‎02-25-2013

Can the cost of remote memory use be so much higher in Sandy bridge?

Bernard · ‎02-25-2013

@Pavel

It is not an easy question to answer.There also very scarce information about the NUMA in Intel SDM.I'm posting a link to very interesting discussion about the NUMA related performance.I posted there a few links one of them gives detailed explanation of the NUMA performance degradation.

Link to the post :://software.intel.com/en-us/forums/topic/346334

Very interesting information regarding NUMA performation degradation link ://communities.vmware.com/thread/391284

Bernard · ‎02-25-2013

@Pavel

I posted a few links to the very interested discussion also related to the NUMA and performance degradation.Unfortunately still my posts are queued for the admin approval so I'm posting below a part of my answer on that discussion.

>>>Probably NUMA architecture - related memory distances coupled with the thread beign executed by the different nodes and forced to access its non-local memory could be responsible for any performance degradation related to the memory accesses. When the number of nodes is greater than 1 some performance penalty will be expected.IIRC the performance penalty is measured in unit of "NUMA distance" with the normalized value of 10 and every access to the local memory has the cost of 10(normalized value) thats mean 1.0 when the process accesses off-node memory(remote) from the NUMA "point of view" some penalty will be added because of overhead related to moving data over the numa interlink.Accessing neighbouring node can add up to 0.4 performance penalty so the total penalty can reach 1.4. More information can be found in ACPI documentation>>>

Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge