<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;&amp;gt;My Dell Precision M4700 in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949527#M2125</link>
    <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;My Dell Precision M4700 with Windows 7 Professional 64-bit OS is &lt;STRONG&gt;highly optimized&lt;/STRONG&gt; for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Disabling network adapter is wise decision because of servicing network card incured interrupts and further packet processing can hog down the CPU.I would also recommend to run from time to time general system monitoring with the help of Xperf tool you will get a very detailed breakdown of various activity.Moreover it is recommended to disable(when you are not connected to the Internet) your AV software.It is known that for example Kaspersky AV uses system wide hooks and detours to check system function callers and this activity can add to the load on CPU.Moreover AV often installs custom drivers used to gain access into various internal OS structures implemented in kernel and this activity is sometimes done at IRQL == DPC_LEVEL mostly for synchronization and can block scheduler which also runs at DPC_LEVEL so uninstalling an AV on developer's machine is highly recommended.&lt;/P&gt;</description>
    <pubDate>Fri, 22 Feb 2013 05:21:00 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2013-02-22T05:21:00Z</dc:date>
    <item>
      <title>Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949507#M2105</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;After upgrading servers from&amp;nbsp;&lt;STRONG&gt;Dual Xeon E5645 2.4GHz (Nehalem) &lt;/STRONG&gt;to&lt;STRONG&gt;&amp;nbsp;Dual Xeon E5-2620 2.0GHz (Sandy bridge)&amp;nbsp;&lt;/STRONG&gt;I have serious performance decrease in my multithreaded application. I have created small C++ sample&amp;nbsp;(attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations.&amp;nbsp;I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Dual Xeon E5645 2.4GHz (Nehalem):&amp;nbsp;&lt;/STRONG&gt;Main thread -&amp;nbsp;340.522[ms], Separate thread:&amp;nbsp;388.598[ms] &amp;nbsp;&lt;STRONG&gt;Diff: 13%&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Dual Xeon E5-2620 2.0GHz (Sandy bridge):&amp;nbsp;&lt;/STRONG&gt;Main thread -&amp;nbsp;362.515[ms], Separate thread:&amp;nbsp;565.295[ms]&lt;STRONG&gt;&amp;nbsp; Diff: 36%&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;/STRONG&gt;&lt;STRONG&gt;Single Xeon E3-1230 V2 3.3GHz (Ivy bridge):&amp;nbsp;&lt;/STRONG&gt;Main thread - 234.928[ms], Separate thread: 267.603[ms]&lt;STRONG&gt; &amp;nbsp;&lt;STRONG&gt;Diff: 13%&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?&lt;/P&gt;
&lt;P&gt;Many thanks, Pavel.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Feb 2013 23:44:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949507#M2105</guid>
      <dc:creator>Pavel_Kogan</dc:creator>
      <dc:date>2013-02-18T23:44:33Z</dc:date>
    </item>
    <item>
      <title>Hi Pavel,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949508#M2106</link>
      <description>Hi Pavel,

I don't have a system with &lt;STRONG&gt;Xeon Ex-xxxx&lt;/STRONG&gt; but I could try to investigate ( at the end of the week ) what could be possibly wrong. I have &lt;STRONG&gt;Intel Core i7-3840QM&lt;/STRONG&gt; ( Ivy Bridge / 4 cores ) and let me know if you're interested.

Could you provide L1, L2 and L3 cache line sizes for all CPUs? ( from ark.intel.com )</description>
      <pubDate>Tue, 19 Feb 2013 05:13:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949508#M2106</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-19T05:13:45Z</dc:date>
    </item>
    <item>
      <title>I think that profiling your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949509#M2107</link>
      <description>&lt;P&gt;I think that profiling your program with Xperf should be done first.The main idea is to check what is the time spent in thread creation stage and cs(context switch)&amp;nbsp;stage.Please install Xperf or run it if you have it installed already.Next start your application.Below are commands to be entered from the elevated command prompt.&lt;/P&gt;
&lt;P&gt;xperf.exe -on -stackwalk PROC_THREAD+CSWITCH&lt;/P&gt;
&lt;P&gt;xperf.exe -stop "name of your file".etl&lt;/P&gt;</description>
      <pubDate>Tue, 19 Feb 2013 06:10:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949509#M2107</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-19T06:10:06Z</dc:date>
    </item>
    <item>
      <title>Hi Pavel</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949510#M2108</link>
      <description>&lt;P&gt;Hi Pavel&lt;/P&gt;
&lt;P&gt;I have forgotten to add that you need to disable paging on Win7 64-bit. Use these commands&lt;/P&gt;
&lt;P&gt;REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f&lt;/P&gt;</description>
      <pubDate>Tue, 19 Feb 2013 08:36:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949510#M2108</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-19T08:36:40Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949511#M2109</link>
      <description>&lt;P&gt;Hi Sergey,&lt;/P&gt;
&lt;P&gt;Thanks for your offer, I hope to resolve the problem before weekend, but who knows. In abovementioned site I found only L3 cache size. The cache sizes are:&amp;nbsp;&lt;STRONG&gt;Xeon E5645 - 12M (shared between 6 cores) ,&amp;nbsp;&lt;STRONG&gt;Xeon E5-2620 - 15M&lt;STRONG&gt;&amp;nbsp;(shared between 6 cores), Xeon&amp;nbsp;&lt;STRONG&gt;E3-1230V2 - 8M (shared between 4 cores).&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Feb 2013 12:50:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949511#M2109</guid>
      <dc:creator>Pavel_Kogan</dc:creator>
      <dc:date>2013-02-19T12:50:59Z</dc:date>
    </item>
    <item>
      <title>Hello Pavel,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949512#M2110</link>
      <description>&lt;P&gt;Hello Pavel,&lt;/P&gt;
&lt;P&gt;I don't VS2012+ installed so I don't have the &amp;lt;thread&amp;gt; file... so I can't build your example.&lt;/P&gt;
&lt;P&gt;Have tried adding timing statements just inside the Run() routine? It seems like this would tell you if the work is running slower or if the overhead of creating a thread is just much higher in Sandybridge case versus other cases.&lt;/P&gt;
&lt;P&gt;Pat&lt;/P&gt;</description>
      <pubDate>Tue, 19 Feb 2013 13:10:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949512#M2110</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2013-02-19T13:10:31Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;... I found only L3 cache</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949513#M2111</link>
      <description>&amp;gt;&amp;gt;... I found only L3 cache size. The cache sizes are:

All the rest numbers have to be in Datasheets ( PDFs / links are on the right part of a web-page for a given CPU on ark.intel.com ).

&amp;gt;&amp;gt;Xeon E5645 - &lt;STRONG&gt;12M&lt;/STRONG&gt; (shared between 6 cores) ,
&amp;gt;&amp;gt;Xeon E5-2620 - &lt;STRONG&gt;15M&lt;/STRONG&gt; (shared between 6 cores),
&amp;gt;&amp;gt;&lt;STRONG&gt;Xeon E3-1230V2 - 8M (shared between 4 cores)&lt;/STRONG&gt;

It matches to my system and it will be interesting if 13% difference in performance will be reproduced.

&amp;gt;&amp;gt;...LUT with &lt;STRONG&gt;3000 int rows&lt;/STRONG&gt;, each row contains about &lt;STRONG&gt;2000 numbers&lt;/STRONG&gt;...

Simply to note, the size of your LUT ( 3000 * 2000 * sizeof(int) = 6000000 * 4 = 24000000 ) is ~&lt;STRONG&gt;22.89MB&lt;/STRONG&gt; and it exceeds the size of L3 cache line for any  system you use.

Then, the LUT is created in a primary thread and in the 2nd case an additional thread could be scheduled for a different CPU ( needs to be investigated! ). In that scenario both threads, scheduled for different CPUs, are possibly competing for access to L3 cache. In terms of common problems of multi-threading two cases are possible:

- Race Conditions ( more likely / consider your input array is a "large shared variable"... )
- False Sharing ( less likely )

Could you try to use VTune to review what is going on with L3 cache lines? Another option to consider is to &lt;STRONG&gt;pause&lt;/STRONG&gt; the primary thread until processing in the 2nd thread is completed ( some synchronization object has to be used ).</description>
      <pubDate>Tue, 19 Feb 2013 14:01:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949513#M2111</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-19T14:01:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Another option to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949514#M2112</link>
      <description>&amp;gt;&amp;gt;...Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some
&amp;gt;&amp;gt;synchronization object has to be used ).

Or, with Win32 API something like:
...
::&lt;STRONG&gt;SuspendThread&lt;/STRONG&gt;( hPrimaryThread );
...
Note: 2nd thread should suspend the primary thread and then resume it as soon as the processing is completed.

but I think it could be done in a different way with API from &lt;STRONG&gt;thread&lt;/STRONG&gt; header.</description>
      <pubDate>Tue, 19 Feb 2013 14:18:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949514#M2112</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-19T14:18:23Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949515#M2113</link>
      <description>&lt;P&gt;Hi Sergey,&lt;/P&gt;
&lt;P&gt;It is true that whole data is larger than L3 cache, however there is no race as only one thread is running and other is suspended (join). Besides, I am not saying my implementation is super optimized and considers cache sizes, I just need to understand why the difference between different servers.&lt;/P&gt;
&lt;P&gt;Thanks, Pavel&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Feb 2013 14:50:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949515#M2113</guid>
      <dc:creator>Pavel_Kogan</dc:creator>
      <dc:date>2013-02-19T14:50:36Z</dc:date>
    </item>
    <item>
      <title>@Pavel</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949516#M2114</link>
      <description>&lt;P&gt;@Pavel&lt;/P&gt;
&lt;P&gt;Beside running xperf you can also profile your code with the VTune as it was suggested by Sergey.If you need an precise percentage of time spent in thread creation procedures and contex switching procedures it is advised to use xperf.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Feb 2013 05:19:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949516#M2114</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-20T05:19:29Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;but I think it could be</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949517#M2115</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;but I think it could be done in a different way with API from &lt;STRONG&gt;thread&lt;/STRONG&gt; header&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;This simply means adding another layer of indirection above Win API.Will not be a better option to call directly thread scheduling API directly from his code?&lt;/P&gt;</description>
      <pubDate>Wed, 20 Feb 2013 05:24:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949517#M2115</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-20T05:24:08Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Will not be a better</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949518#M2116</link>
      <description>&amp;gt;&amp;gt;...Will not be a better option to call directly thread scheduling API directly from his code?

No. The test is very simple and you could try to run ( or debug ) it in order to see how it works.</description>
      <pubDate>Wed, 20 Feb 2013 20:39:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949518#M2116</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-20T20:39:34Z</dc:date>
    </item>
    <item>
      <title>Pavel,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949519#M2117</link>
      <description>Pavel,

I have Not reproduced your problem and on my computer when a command line option '--fast' was used it ran faster. Here are tests results:

&lt;STRONG&gt;[ Tests - Debug ]&lt;/STRONG&gt;

..&amp;gt;main.exe
Average run time: 546.466[ms]

..&amp;gt;main.exe --fast
Average run time: 392.835[ms]

&lt;STRONG&gt;[ Tests - Release ]&lt;/STRONG&gt;

..&amp;gt;main.exe
Average run time: 426.612[ms]

..&amp;gt;main.exe --fast
Average run time: 391.799[ms]</description>
      <pubDate>Wed, 20 Feb 2013 20:43:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949519#M2117</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-20T20:43:39Z</dc:date>
    </item>
    <item>
      <title>Here are details on how</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949520#M2118</link>
      <description>Here are details on how executables were compiled:

&lt;STRONG&gt;Notes:&lt;/STRONG&gt;

- Visual Studio 2012 environment &amp;amp; Intel C++ compiler XE 13.0.0.089 ( Initial Release )
- No any modifications in your source codes

&lt;STRONG&gt;[ Compilation - Debug ]&lt;/STRONG&gt;

..&amp;gt;icl /MDd main.cpp
Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

main.cpp
Microsoft (R) Incremental Linker Version 11.00.50727.1
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
main.obj

&lt;STRONG&gt;[ Compilation - Release ]&lt;/STRONG&gt;

..&amp;gt;icl /MD main.cpp
Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

main.cpp
Microsoft (R) Incremental Linker Version 11.00.50727.1
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
main.obj

&lt;STRONG&gt;Hardware &amp;amp; Software:&lt;/STRONG&gt;
OS Name		Microsoft Windows 7 Professional
Version		6.1.7601 Service Pack 1 Build 7601
System Model	Dell Precision M4700
System Type		x64-based PC
Processor		Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)</description>
      <pubDate>Wed, 20 Feb 2013 20:47:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949520#M2118</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-20T20:47:23Z</dc:date>
    </item>
    <item>
      <title>Thanks all, I will be back to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949521#M2119</link>
      <description>&lt;P&gt;Thanks all, I will be back to work on this problem in day or two and will update you with results.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Feb 2013 20:58:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949521#M2119</guid>
      <dc:creator>Pavel_Kogan</dc:creator>
      <dc:date>2013-02-20T20:58:17Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;No. The test is very</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949522#M2120</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;No. The test is very simple and you could try to run ( or debug ) it in order to see how it works&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Ok.I will test on my pc.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Feb 2013 05:12:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949522#M2120</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-21T05:12:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;&gt;No. The test is very</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949523#M2121</link>
      <description>&amp;gt;&amp;gt;&amp;gt;&amp;gt;No. The test is very simple and you could try to run ( or debug ) it in order to see how it works...
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;Ok.I will test on my pc.

That would be nice.

You will need some C/C++ compiler that has &lt;STRONG&gt;thread&lt;/STRONG&gt; header file. So far I see the one only in Visual Studio 2012. Please take into account that Express Edition ( available for free ) could be used ( this is what I have ) and you could compile the test with a default Microsoft C++ compiler ( you don't need Intel C++ compiler ). Let me know if you need Visual Studio 2012 project for your tests.

Thanks in advance.</description>
      <pubDate>Thu, 21 Feb 2013 13:56:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949523#M2121</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-21T13:56:16Z</dc:date>
    </item>
    <item>
      <title>Hi all, </title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949524#M2122</link>
      <description>&lt;P&gt;Hi all,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I noticed that changing &amp;nbsp;&lt;STRONG&gt;thread t(&amp;amp;CorticaTask::Run, task)&lt;/STRONG&gt; to&lt;STRONG&gt;&amp;nbsp;thread t(&amp;amp;CorticaTask::Run, &amp;amp;task)&amp;nbsp;&lt;/STRONG&gt;makes things runs significantly faster (on Sandy), which is undertandable, however it still very strange that it is running slower in some working point on better and newer server.&lt;/P&gt;
&lt;P&gt;Regards, Pavel&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Feb 2013 15:28:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949524#M2122</guid>
      <dc:creator>Pavel_Kogan</dc:creator>
      <dc:date>2013-02-21T15:28:41Z</dc:date>
    </item>
    <item>
      <title>Hello Pavel,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949525#M2123</link>
      <description>&lt;P&gt;Hello Pavel,&lt;/P&gt;
&lt;P&gt;Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop.&lt;/P&gt;
&lt;P&gt;Pat&lt;/P&gt;</description>
      <pubDate>Thu, 21 Feb 2013 15:51:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949525#M2123</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2013-02-21T15:51:49Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...it still very strange</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949526#M2124</link>
      <description>&amp;gt;&amp;gt;...it still very strange that it is running slower in some working point on better and newer server...

Pavel,

My Dell Precision M4700 with Windows 7 Professional 64-bit OS is &lt;STRONG&gt;highly optimized&lt;/STRONG&gt; for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working. It makes sense for you to check how many Windows Services are working on your computers. By default, just right after Windows installation is completed, at least 50-60 different Windows Services are working and that number could be even greater. Please also check settings for Anti-Virus software.

If you need a detailed list of my software configuration(s) I could provide it.

&amp;gt;&amp;gt;...how much of the runtime variation is due to thread creation overhead 

Patrick,

Windows creates threads very fast. I don't have an exact number but it has to be done in a couple of hundres microseconds, or less. Pavel's differences in performance are two big. However, such a verification with RDTSC instruction will be useful.

My overall conclusion is that something else is wrong and some software or hardware affects performance.

Note: Pavel, Did you install all updates for Visual Studio 2012? I did it last weekend...</description>
      <pubDate>Thu, 21 Feb 2013 19:51:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Threads-overhead-Nehalem-vs-Sandy-bridge-vs-Ivy-bridge/m-p/949526#M2124</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-21T19:51:11Z</dc:date>
    </item>
  </channel>
</rss>

