<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic At the start of your program in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082800#M7108</link>
    <description>&lt;P&gt;At the start of your program add&lt;/P&gt;

&lt;P&gt;int nThreads;&lt;BR /&gt;
	#pragma omp parallel&lt;BR /&gt;
	nThreads = omp_get_num_threads();&lt;/P&gt;

&lt;P&gt;The intention is to enter a first parallel region, that is outside of your timed loop, and thus pre-creating the OpenMP thread pool with a full complement of threads. (You will have to expand on this if you use nested parallel regions). The way you structured your program, each increase in thread count caused unnecessary overhead.&lt;/P&gt;

&lt;P&gt;Please do this and report back your findings.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 14 Apr 2016 12:46:30 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2016-04-14T12:46:30Z</dc:date>
    <item>
      <title>Performance with thread pooling in OpenMP</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082799#M7107</link>
      <description>&lt;P&gt;I have following code:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;float arr[1000];
double pre = omp_get_wtime();
for(int j=0; j&amp;lt;1000; ++j)
{
  #pragma omp parallel num_threads(t1)
  {
    #pragma omp for
    for(int i=0; i&amp;lt;1000; ++i) arr&lt;I&gt; = std::pow(i,2);
  }
  #pragma omp parallel num_threads(t2)
  {
    #pragma omp for
    for(int i=0; i&amp;lt;1000; ++i) arr&lt;I&gt; = std::pow(i,2);
  }
}
double post = omp_get_wtime();
double diff = post - pre;
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;I get strange times for t1 and t2:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;for t1=1, t2=36 diff is 0.070&lt;/LI&gt;
	&lt;LI&gt;for t1=2, t2=36 diff is 1.307&lt;/LI&gt;
	&lt;LI&gt;for t1=8, t2=36 diff is 1.023&lt;/LI&gt;
	&lt;LI&gt;for t1=18, t2=36 diff is 0.690&lt;/LI&gt;
	&lt;LI&gt;for t1=24, t2=36 diff is 0.427&lt;/LI&gt;
	&lt;LI&gt;for t1=36, t2=36 diff is 0.076&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, cores per socket: 18, virtualization: VT-x, sockets: 2, L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 46080K, CentOS 7&lt;/P&gt;

&lt;P&gt;Is there any problem with thread pooling between sections (teams) in OpenMP ?&lt;BR /&gt;
	Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Apr 2016 08:42:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082799#M7107</guid>
      <dc:creator>Krzysztof_B_Intel</dc:creator>
      <dc:date>2016-04-11T08:42:40Z</dc:date>
    </item>
    <item>
      <title>At the start of your program</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082800#M7108</link>
      <description>&lt;P&gt;At the start of your program add&lt;/P&gt;

&lt;P&gt;int nThreads;&lt;BR /&gt;
	#pragma omp parallel&lt;BR /&gt;
	nThreads = omp_get_num_threads();&lt;/P&gt;

&lt;P&gt;The intention is to enter a first parallel region, that is outside of your timed loop, and thus pre-creating the OpenMP thread pool with a full complement of threads. (You will have to expand on this if you use nested parallel regions). The way you structured your program, each increase in thread count caused unnecessary overhead.&lt;/P&gt;

&lt;P&gt;Please do this and report back your findings.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Apr 2016 12:46:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082800#M7108</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-04-14T12:46:30Z</dc:date>
    </item>
    <item>
      <title>Unfortunately, it doesn't</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082801#M7109</link>
      <description>&lt;P&gt;Unfortunately, it doesn't work. Our team still try to find a solution.&lt;BR /&gt;
	Thank you for your answer.&lt;/P&gt;

&lt;P&gt;Krzysztof Binias&lt;/P&gt;</description>
      <pubDate>Fri, 15 Apr 2016 20:43:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082801#M7109</guid>
      <dc:creator>Krzysztof_B_Intel</dc:creator>
      <dc:date>2016-04-15T20:43:00Z</dc:date>
    </item>
    <item>
      <title>With the worst case adding</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082802#M7110</link>
      <description>&lt;P&gt;With the worst case adding more than one second to the best case leads me to suspect this is a program initialization issue. Such discrepancy can occur if your system is heavily loaded. Can you upload your entire test program that exhibits this problem. One such example is if you specify an (obscenely) large stack size, your program is doing something to "first touch" this stack, and as a consequence each new thread instantiation causes an excessive amount of page faults (to allocate from page file, map to VM, possibly wipe), all in competition with other demands on your storage system. This would occur as a once only symptom. Once OpenMP creates the thread (adds to a given thread pool), the threads remain available for first and subsequent use. Thereafter any new "first touch" of your VM would undergo page fault hoop jump.&lt;/P&gt;

&lt;P&gt;Also, as an experimental probe, as well as insight purposed, what happens when you swap t1 and t2 in your num_threads clauses?&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sun, 17 Apr 2016 13:30:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082802#M7110</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-04-17T13:30:42Z</dc:date>
    </item>
    <item>
      <title>&gt; Also, as an experimental</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082803#M7111</link>
      <description>&lt;P&gt;&amp;gt; Also, as an experimental probe, as well as insight purposed, what happens when you swap t1 and t2 in your num_threads clauses ?&lt;/P&gt;

&lt;P&gt;The same problem. Test source code attached to this post.&lt;/P&gt;

&lt;P&gt;Krzysztof Binias&lt;/P&gt;</description>
      <pubDate>Mon, 18 Apr 2016 07:43:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Performance-with-thread-pooling-in-OpenMP/m-p/1082803#M7111</guid>
      <dc:creator>Krzysztof_B_Intel</dc:creator>
      <dc:date>2016-04-18T07:43:59Z</dc:date>
    </item>
  </channel>
</rss>

