Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge

Pavel_Kogan
Beginner
4,434 Views

Hi all,

After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:

Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms]  Diff: 13%

Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms]  Diff: 36%

Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms]  Diff: 13%

My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?

Many thanks, Pavel.

0 Kudos
55 Replies
SergeyKostrov
Valued Contributor II
2,177 Views
Hi Pavel, I don't have a system with Xeon Ex-xxxx but I could try to investigate ( at the end of the week ) what could be possibly wrong. I have Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and let me know if you're interested. Could you provide L1, L2 and L3 cache line sizes for all CPUs? ( from ark.intel.com )
0 Kudos
Bernard
Valued Contributor I
2,177 Views

I think that profiling your program with Xperf should be done first.The main idea is to check what is the time spent in thread creation stage and cs(context switch) stage.Please install Xperf or run it if you have it installed already.Next start your application.Below are commands to be entered from the elevated command prompt.

xperf.exe -on -stackwalk PROC_THREAD+CSWITCH

xperf.exe -stop "name of your file".etl

0 Kudos
Bernard
Valued Contributor I
2,177 Views

Hi Pavel

I have forgotten to add that you need to disable paging on Win7 64-bit. Use these commands

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

0 Kudos
Pavel_Kogan
Beginner
2,177 Views

Hi Sergey,

Thanks for your offer, I hope to resolve the problem before weekend, but who knows. In abovementioned site I found only L3 cache size. The cache sizes are: Xeon E5645 - 12M (shared between 6 cores) , Xeon E5-2620 - 15M (shared between 6 cores), Xeon E3-1230V2 - 8M (shared between 4 cores).

0 Kudos
Patrick_F_Intel1
Employee
2,177 Views

Hello Pavel,

I don't VS2012+ installed so I don't have the <thread> file... so I can't build your example.

Have tried adding timing statements just inside the Run() routine? It seems like this would tell you if the work is running slower or if the overhead of creating a thread is just much higher in Sandybridge case versus other cases.

Pat

0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
>>... I found only L3 cache size. The cache sizes are: All the rest numbers have to be in Datasheets ( PDFs / links are on the right part of a web-page for a given CPU on ark.intel.com ). >>Xeon E5645 - 12M (shared between 6 cores) , >>Xeon E5-2620 - 15M (shared between 6 cores), >>Xeon E3-1230V2 - 8M (shared between 4 cores) It matches to my system and it will be interesting if 13% difference in performance will be reproduced. >>...LUT with 3000 int rows, each row contains about 2000 numbers... Simply to note, the size of your LUT ( 3000 * 2000 * sizeof(int) = 6000000 * 4 = 24000000 ) is ~22.89MB and it exceeds the size of L3 cache line for any system you use. Then, the LUT is created in a primary thread and in the 2nd case an additional thread could be scheduled for a different CPU ( needs to be investigated! ). In that scenario both threads, scheduled for different CPUs, are possibly competing for access to L3 cache. In terms of common problems of multi-threading two cases are possible: - Race Conditions ( more likely / consider your input array is a "large shared variable"... ) - False Sharing ( less likely ) Could you try to use VTune to review what is going on with L3 cache lines? Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some synchronization object has to be used ).
0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
>>...Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some >>synchronization object has to be used ). Or, with Win32 API something like: ... ::SuspendThread( hPrimaryThread ); ... Note: 2nd thread should suspend the primary thread and then resume it as soon as the processing is completed. but I think it could be done in a different way with API from thread header.
0 Kudos
Pavel_Kogan
Beginner
2,177 Views

Hi Sergey,

It is true that whole data is larger than L3 cache, however there is no race as only one thread is running and other is suspended (join). Besides, I am not saying my implementation is super optimized and considers cache sizes, I just need to understand why the difference between different servers.

Thanks, Pavel

0 Kudos
Bernard
Valued Contributor I
2,177 Views

@Pavel

Beside running xperf you can also profile your code with the VTune as it was suggested by Sergey.If you need an precise percentage of time spent in thread creation procedures and contex switching procedures it is advised to use xperf.

0 Kudos
Bernard
Valued Contributor I
2,177 Views

>>>but I think it could be done in a different way with API from thread header>>>

This simply means adding another layer of indirection above Win API.Will not be a better option to call directly thread scheduling API directly from his code?

0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
>>...Will not be a better option to call directly thread scheduling API directly from his code? No. The test is very simple and you could try to run ( or debug ) it in order to see how it works.
0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
Pavel, I have Not reproduced your problem and on my computer when a command line option '--fast' was used it ran faster. Here are tests results: [ Tests - Debug ] ..>main.exe Average run time: 546.466[ms] ..>main.exe --fast Average run time: 392.835[ms] [ Tests - Release ] ..>main.exe Average run time: 426.612[ms] ..>main.exe --fast Average run time: 391.799[ms]
0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
Here are details on how executables were compiled: Notes: - Visual Studio 2012 environment & Intel C++ compiler XE 13.0.0.089 ( Initial Release ) - No any modifications in your source codes [ Compilation - Debug ] ..>icl /MDd main.cpp Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. main.cpp Microsoft (R) Incremental Linker Version 11.00.50727.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:main.exe main.obj [ Compilation - Release ] ..>icl /MD main.cpp Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. main.cpp Microsoft (R) Incremental Linker Version 11.00.50727.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:main.exe main.obj Hardware & Software: OS Name Microsoft Windows 7 Professional Version 6.1.7601 Service Pack 1 Build 7601 System Model Dell Precision M4700 System Type x64-based PC Processor Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)
0 Kudos
Pavel_Kogan
Beginner
2,177 Views

Thanks all, I will be back to work on this problem in day or two and will update you with results.

0 Kudos
Bernard
Valued Contributor I
2,177 Views

>>>No. The test is very simple and you could try to run ( or debug ) it in order to see how it works>>>

Ok.I will test on my pc.

0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
>>>>No. The test is very simple and you could try to run ( or debug ) it in order to see how it works... >> >>Ok.I will test on my pc. That would be nice. You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. Please take into account that Express Edition ( available for free ) could be used ( this is what I have ) and you could compile the test with a default Microsoft C++ compiler ( you don't need Intel C++ compiler ). Let me know if you need Visual Studio 2012 project for your tests. Thanks in advance.
0 Kudos
Pavel_Kogan
Beginner
2,177 Views

Hi all, 

I noticed that changing  thread t(&CorticaTask::Run, task) to thread t(&CorticaTask::Run, &task) makes things runs significantly faster (on Sandy), which is undertandable, however it still very strange that it is running slower in some working point on better and newer server.

Regards, Pavel

0 Kudos
Patrick_F_Intel1
Employee
2,177 Views

Hello Pavel,

Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop.

Pat

0 Kudos
SergeyKostrov
Valued Contributor II
2,177 Views
>>...it still very strange that it is running slower in some working point on better and newer server... Pavel, My Dell Precision M4700 with Windows 7 Professional 64-bit OS is highly optimized for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working. It makes sense for you to check how many Windows Services are working on your computers. By default, just right after Windows installation is completed, at least 50-60 different Windows Services are working and that number could be even greater. Please also check settings for Anti-Virus software. If you need a detailed list of my software configuration(s) I could provide it. >>...how much of the runtime variation is due to thread creation overhead Patrick, Windows creates threads very fast. I don't have an exact number but it has to be done in a couple of hundres microseconds, or less. Pavel's differences in performance are two big. However, such a verification with RDTSC instruction will be useful. My overall conclusion is that something else is wrong and some software or hardware affects performance. Note: Pavel, Did you install all updates for Visual Studio 2012? I did it last weekend...
0 Kudos
Bernard
Valued Contributor I
2,054 Views

>>>My Dell Precision M4700 with Windows 7 Professional 64-bit OS is highly optimized for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working>>>

Disabling network adapter is wise decision because of servicing network card incured interrupts and further packet processing can hog down the CPU.I would also recommend to run from time to time general system monitoring with the help of Xperf tool you will get a very detailed breakdown of various activity.Moreover it is recommended to disable(when you are not connected to the Internet) your AV software.It is known that for example Kaspersky AV uses system wide hooks and detours to check system function callers and this activity can add to the load on CPU.Moreover AV often installs custom drivers used to gain access into various internal OS structures implemented in kernel and this activity is sometimes done at IRQL == DPC_LEVEL mostly for synchronization and can block scheduler which also runs at DPC_LEVEL so uninstalling an AV on developer's machine is highly recommended.

0 Kudos
Reply