topic >>>My Dell Precision M4700 in Software Tuning, Performance Optimization & Platform Monitoring

Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge

Pavel_Kogan — Mon, 18 Feb 2013 23:44:33 GMT

Hi all,

After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:

Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%

Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%

Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%

My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?

Many thanks, Pavel.

Hi Pavel,

SergeyKostrov — Tue, 19 Feb 2013 05:13:45 GMT

Hi Pavel, I don't have a system with Xeon Ex-xxxx but I could try to investigate ( at the end of the week ) what could be possibly wrong. I have Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and let me know if you're interested. Could you provide L1, L2 and L3 cache line sizes for all CPUs? ( from ark.intel.com )

I think that profiling your

Bernard — Tue, 19 Feb 2013 06:10:06 GMT

I think that profiling your program with Xperf should be done first.The main idea is to check what is the time spent in thread creation stage and cs(context switch) stage.Please install Xperf or run it if you have it installed already.Next start your application.Below are commands to be entered from the elevated command prompt.

xperf.exe -on -stackwalk PROC_THREAD+CSWITCH

xperf.exe -stop "name of your file".etl

Hi Pavel

Bernard — Tue, 19 Feb 2013 08:36:40 GMT

Hi Pavel

I have forgotten to add that you need to disable paging on Win7 64-bit. Use these commands

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

Hi Sergey,

Pavel_Kogan — Tue, 19 Feb 2013 12:50:59 GMT

Hi Sergey,

Thanks for your offer, I hope to resolve the problem before weekend, but who knows. In abovementioned site I found only L3 cache size. The cache sizes are: Xeon E5645 - 12M (shared between 6 cores) , Xeon E5-2620 - 15M (shared between 6 cores), Xeon E3-1230V2 - 8M (shared between 4 cores).

Hello Pavel,

Patrick_F_Intel1 — Tue, 19 Feb 2013 13:10:31 GMT

Hello Pavel,

I don't VS2012+ installed so I don't have the <thread> file... so I can't build your example.

Have tried adding timing statements just inside the Run() routine? It seems like this would tell you if the work is running slower or if the overhead of creating a thread is just much higher in Sandybridge case versus other cases.

Pat

>>... I found only L3 cache

SergeyKostrov — Tue, 19 Feb 2013 14:01:00 GMT

>>... I found only L3 cache size. The cache sizes are: All the rest numbers have to be in Datasheets ( PDFs / links are on the right part of a web-page for a given CPU on ark.intel.com ). >>Xeon E5645 - 12M (shared between 6 cores) , >>Xeon E5-2620 - 15M (shared between 6 cores), >>Xeon E3-1230V2 - 8M (shared between 4 cores) It matches to my system and it will be interesting if 13% difference in performance will be reproduced. >>...LUT with 3000 int rows, each row contains about 2000 numbers... Simply to note, the size of your LUT ( 3000 * 2000 * sizeof(int) = 6000000 * 4 = 24000000 ) is ~22.89MB and it exceeds the size of L3 cache line for any system you use. Then, the LUT is created in a primary thread and in the 2nd case an additional thread could be scheduled for a different CPU ( needs to be investigated! ). In that scenario both threads, scheduled for different CPUs, are possibly competing for access to L3 cache. In terms of common problems of multi-threading two cases are possible: - Race Conditions ( more likely / consider your input array is a "large shared variable"... ) - False Sharing ( less likely ) Could you try to use VTune to review what is going on with L3 cache lines? Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some synchronization object has to be used ).

>>...Another option to

SergeyKostrov — Tue, 19 Feb 2013 14:18:23 GMT

>>...Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some >>synchronization object has to be used ). Or, with Win32 API something like: ... ::SuspendThread( hPrimaryThread ); ... Note: 2nd thread should suspend the primary thread and then resume it as soon as the processing is completed. but I think it could be done in a different way with API from thread header.

Hi Sergey,

Pavel_Kogan — Tue, 19 Feb 2013 14:50:36 GMT

Hi Sergey,

It is true that whole data is larger than L3 cache, however there is no race as only one thread is running and other is suspended (join). Besides, I am not saying my implementation is super optimized and considers cache sizes, I just need to understand why the difference between different servers.

Thanks, Pavel

@Pavel

Bernard — Wed, 20 Feb 2013 05:19:29 GMT

@Pavel

Beside running xperf you can also profile your code with the VTune as it was suggested by Sergey.If you need an precise percentage of time spent in thread creation procedures and contex switching procedures it is advised to use xperf.

>>>but I think it could be

Bernard — Wed, 20 Feb 2013 05:24:08 GMT

>>>but I think it could be done in a different way with API from thread header>>>

This simply means adding another layer of indirection above Win API.Will not be a better option to call directly thread scheduling API directly from his code?

>>...Will not be a better

SergeyKostrov — Wed, 20 Feb 2013 20:39:34 GMT

>>...Will not be a better option to call directly thread scheduling API directly from his code? No. The test is very simple and you could try to run ( or debug ) it in order to see how it works.

Pavel,

SergeyKostrov — Wed, 20 Feb 2013 20:43:39 GMT

Pavel, I have Not reproduced your problem and on my computer when a command line option '--fast' was used it ran faster. Here are tests results: [ Tests - Debug ] ..>main.exe Average run time: 546.466[ms] ..>main.exe --fast Average run time: 392.835[ms] [ Tests - Release ] ..>main.exe Average run time: 426.612[ms] ..>main.exe --fast Average run time: 391.799[ms]

Here are details on how

SergeyKostrov — Wed, 20 Feb 2013 20:47:23 GMT

Here are details on how executables were compiled: Notes: - Visual Studio 2012 environment & Intel C++ compiler XE 13.0.0.089 ( Initial Release ) - No any modifications in your source codes [ Compilation - Debug ] ..>icl /MDd main.cpp Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. main.cpp Microsoft (R) Incremental Linker Version 11.00.50727.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:main.exe main.obj [ Compilation - Release ] ..>icl /MD main.cpp Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. main.cpp Microsoft (R) Incremental Linker Version 11.00.50727.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:main.exe main.obj Hardware & Software: OS Name Microsoft Windows 7 Professional Version 6.1.7601 Service Pack 1 Build 7601 System Model Dell Precision M4700 System Type x64-based PC Processor Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)

Thanks all, I will be back to

Pavel_Kogan — Wed, 20 Feb 2013 20:58:17 GMT

Thanks all, I will be back to work on this problem in day or two and will update you with results.

>>>No. The test is very

Bernard — Thu, 21 Feb 2013 05:12:00 GMT

>>>No. The test is very simple and you could try to run ( or debug ) it in order to see how it works>>>

Ok.I will test on my pc.

>>>>No. The test is very

SergeyKostrov — Thu, 21 Feb 2013 13:56:16 GMT

>>>>No. The test is very simple and you could try to run ( or debug ) it in order to see how it works... >> >>Ok.I will test on my pc. That would be nice. You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. Please take into account that Express Edition ( available for free ) could be used ( this is what I have ) and you could compile the test with a default Microsoft C++ compiler ( you don't need Intel C++ compiler ). Let me know if you need Visual Studio 2012 project for your tests. Thanks in advance.

Hi all,

Pavel_Kogan — Thu, 21 Feb 2013 15:28:41 GMT

Hi all,

I noticed that changing thread t(&CorticaTask::Run, task) to thread t(&CorticaTask::Run, &task) makes things runs significantly faster (on Sandy), which is undertandable, however it still very strange that it is running slower in some working point on better and newer server.

Regards, Pavel

Hello Pavel,

Patrick_F_Intel1 — Thu, 21 Feb 2013 15:51:49 GMT

Hello Pavel,

Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop.

Pat

>>...it still very strange

SergeyKostrov — Thu, 21 Feb 2013 19:51:11 GMT

>>...it still very strange that it is running slower in some working point on better and newer server... Pavel, My Dell Precision M4700 with Windows 7 Professional 64-bit OS is highly optimized for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working. It makes sense for you to check how many Windows Services are working on your computers. By default, just right after Windows installation is completed, at least 50-60 different Windows Services are working and that number could be even greater. Please also check settings for Anti-Virus software. If you need a detailed list of my software configuration(s) I could provide it. >>...how much of the runtime variation is due to thread creation overhead Patrick, Windows creates threads very fast. I don't have an exact number but it has to be done in a couple of hundres microseconds, or less. Pavel's differences in performance are two big. However, such a verification with RDTSC instruction will be useful. My overall conclusion is that something else is wrong and some software or hardware affects performance. Note: Pavel, Did you install all updates for Visual Studio 2012? I did it last weekend...