<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Yes, obviously we'd all like in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079617#M61067</link>
    <description>&lt;P&gt;Yes, obviously we'd all like there to be no overheads for anything. (I have a very nice fluffy pink unicorn for sale if you are interested.)&lt;/P&gt;

&lt;P&gt;However, overheads exist and can't be completely eliminated. If you are really concerned about the performance of codes that run for 1s, I have to ask you a few questions :-&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;How long does the shell (or Python) script that sets up for the code take to run?&lt;/LI&gt;
	&lt;LI&gt;If you're not using a script, how long does it take to type the command to run the code?&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Both of those times are also serial overhead which affects the overall time to solution.&lt;/P&gt;

&lt;P&gt;Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost, since there is significant serialization inside the kernel.&lt;/P&gt;</description>
    <pubDate>Tue, 08 Nov 2016 14:16:59 GMT</pubDate>
    <dc:creator>James_C_Intel2</dc:creator>
    <dc:date>2016-11-08T14:16:59Z</dc:date>
    <item>
      <title>KNL omp thread sequential startup</title>
      <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079615#M61065</link>
      <description>&lt;P&gt;When running a KNL OpenMP job with VTune locksandwaits (sampling interval set to 2), the threads are shown as starting at 3 millisecond intervals.&amp;nbsp; In the case where threads are set by KMP_HW_SUBSET=64c,1t&amp;nbsp; threads 1 through 62 start up over a time period of 200 milliseconds and wait (after a tiny execution interval) until all 62 are available.&amp;nbsp; Even after the threads leave the initial wait state, there appears to be about a 90 millisecond delay from first to last worker thread, until they are forced to synchronize at a barrier (so that timing across that barrier may show an extra delay of 90 milliseconds).&lt;/P&gt;

&lt;P&gt;While this is a much smaller overhead than was observed on KNC (presumably due largely to the KNC needing more threads), it appears to mean that full performance can't be reached for a job which doesn't run several seconds.&amp;nbsp; I suppose this is to be expected, but some customers are disappointed.&amp;nbsp; Some may like the effect of super-linear scaling for a job whose size increases, as the delay occurs only at the beginning.&lt;/P&gt;

&lt;P&gt;VTune locksandwaits takes on the order of 2 seconds before worker thread creation begins, so it leaves some doubts in the mind of the beholder about the extent to which VTune may affect performance.&amp;nbsp; [advanced-]hotspots seem to have a more reasonable overhead.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 13:45:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079615#M61065</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-11-08T13:45:05Z</dc:date>
    </item>
    <item>
      <title>Evidently, for some purposes,</title>
      <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079616#M61066</link>
      <description>&lt;P&gt;Evidently, for some purposes, the tactic of putting in a preliminary thread pool warm-up and excluding it from performance reporting (or counting only job repetitions after the first) may be valid.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 13:48:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079616#M61066</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-11-08T13:48:50Z</dc:date>
    </item>
    <item>
      <title>Yes, obviously we'd all like</title>
      <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079617#M61067</link>
      <description>&lt;P&gt;Yes, obviously we'd all like there to be no overheads for anything. (I have a very nice fluffy pink unicorn for sale if you are interested.)&lt;/P&gt;

&lt;P&gt;However, overheads exist and can't be completely eliminated. If you are really concerned about the performance of codes that run for 1s, I have to ask you a few questions :-&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;How long does the shell (or Python) script that sets up for the code take to run?&lt;/LI&gt;
	&lt;LI&gt;If you're not using a script, how long does it take to type the command to run the code?&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Both of those times are also serial overhead which affects the overall time to solution.&lt;/P&gt;

&lt;P&gt;Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost, since there is significant serialization inside the kernel.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 14:16:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079617#M61067</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2016-11-08T14:16:59Z</dc:date>
    </item>
    <item>
      <title>Quote:James C. (Intel) wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079618#M61068</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;James C. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Yes, obviously we'd all like there to be no overheads for anything. (I have a very nice fluffy pink unicorn for sale if you are interested.)&lt;/P&gt;

&lt;P&gt;However, overheads exist and can't be completely eliminated. If you are really concerned about the performance of codes that run for 1s, I have to ask you a few questions :-&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;How long does the shell (or Python) script that sets up for the code take to run?&lt;/LI&gt;
	&lt;LI&gt;If you're not using a script, how long does it take to type the command to run the code?&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Both of those times are also serial overhead which affects the overall time to solution.&lt;/P&gt;

&lt;P&gt;Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost, since there is significant serialization inside the kernel.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Thanks, I think you're confirming that this OpenMP startup behavior is to be expected, and that VTune is correct in reporting thread creation as serial time.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 14:55:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079618#M61068</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-11-08T14:55:09Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;Unfortunately there's not</title>
      <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079619#M61069</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost&lt;/P&gt;

&lt;P&gt;Possibly you can. If the current implementation has the main thread creating the entire thread pool, you could change this to a binary tree type of thread pool creation. Any decent O/S implementation should permit a high degree of concurrency within the O/S. For example, while heap allocation may be serialized, memory wipe (if performed) and/or portions of VM mapping can be concurrent. BTW, I think TBB creates its thread pool in this manner.l&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 16:48:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079619#M61069</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-11-08T16:48:31Z</dc:date>
    </item>
    <item>
      <title> you could change this to a</title>
      <link>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079620#M61070</link>
      <description>&lt;BLOCKQUOTE&gt;
	&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;you could change this to a binary tree type of thread pool creation.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Been there, done that, (and with other branching ratios). It didn't give useful improvement.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 16:54:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNL-omp-thread-sequential-startup/m-p/1079620#M61070</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2016-11-08T16:54:29Z</dc:date>
    </item>
  </channel>
</rss>

