After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:
Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%
Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%
Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%
My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?
Many thanks, Pavel.
I think that profiling your program with Xperf should be done first.The main idea is to check what is the time spent in thread creation stage and cs(context switch) stage.Please install Xperf or run it if you have it installed already.Next start your application.Below are commands to be entered from the elevated command prompt.
xperf.exe -on -stackwalk PROC_THREAD+CSWITCH
xperf.exe -stop "name of your file".etl
I have forgotten to add that you need to disable paging on Win7 64-bit. Use these commands
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
Thanks for your offer, I hope to resolve the problem before weekend, but who knows. In abovementioned site I found only L3 cache size. The cache sizes are: Xeon E5645 - 12M (shared between 6 cores) , Xeon E5-2620 - 15M (shared between 6 cores), Xeon E3-1230V2 - 8M (shared between 4 cores).
I don't VS2012+ installed so I don't have the <thread> file... so I can't build your example.
Have tried adding timing statements just inside the Run() routine? It seems like this would tell you if the work is running slower or if the overhead of creating a thread is just much higher in Sandybridge case versus other cases.
It is true that whole data is larger than L3 cache, however there is no race as only one thread is running and other is suspended (join). Besides, I am not saying my implementation is super optimized and considers cache sizes, I just need to understand why the difference between different servers.
Beside running xperf you can also profile your code with the VTune as it was suggested by Sergey.If you need an precise percentage of time spent in thread creation procedures and contex switching procedures it is advised to use xperf.
>>>but I think it could be done in a different way with API from thread header>>>
This simply means adding another layer of indirection above Win API.Will not be a better option to call directly thread scheduling API directly from his code?
I noticed that changing thread t(&CorticaTask::Run, task) to thread t(&CorticaTask::Run, &task) makes things runs significantly faster (on Sandy), which is undertandable, however it still very strange that it is running slower in some working point on better and newer server.
Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop.