<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: openmp slower than single threaded in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861898#M2451</link>
    <description>&lt;P&gt;Additional about rand():&lt;BR /&gt;&lt;A title="Use of rand() in OpenMP parallel sections" href="http://www.viva64.com/blog/en/2009/05/27/98/" target="_blank"&gt;Use of rand() in OpenMP parallel sections&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 10 Jul 2009 10:13:50 GMT</pubDate>
    <dc:creator>AndreyKarpov</dc:creator>
    <dc:date>2009-07-10T10:13:50Z</dc:date>
    <item>
      <title>openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861894#M2447</link>
      <description>a program for the sole purpose of trying to demonstrate the advantage of using 4 cores simultaneously is below.&lt;BR /&gt;&lt;BR /&gt;however, it runs for 90 seconds on a 4 core xeon (3ghz) versus 2 seconds on a single core machine.&lt;BR /&gt;&lt;BR /&gt;any hints greatly appreciated.&lt;BR /&gt;&lt;BR /&gt;Tom&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;STDLIB.H&gt;&lt;BR /&gt;#include &lt;OMP.H&gt;&lt;BR /&gt;#include &lt;TIME.H&gt;&lt;BR /&gt;&lt;BR /&gt;#define N     1000&lt;BR /&gt;#define CHUNKSIZE 25&lt;BR /&gt;&lt;BR /&gt;main () {&lt;BR /&gt; time_t sec1;&lt;BR /&gt; time_t sec2;&lt;BR /&gt; sec1 = time(NULL);&lt;BR /&gt; printf("start \n");&lt;BR /&gt;&lt;BR /&gt; int i, chunk;&lt;BR /&gt; float a&lt;N&gt;;&lt;BR /&gt; float b&lt;N&gt;;&lt;BR /&gt; float c&lt;N&gt;;&lt;BR /&gt; int j;&lt;BR /&gt; float k;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; for (i=0; i &amp;lt; N; i++)&lt;BR /&gt; a&lt;I&gt; = b&lt;I&gt; = i * 1.0;&lt;BR /&gt; chunk = CHUNKSIZE;&lt;BR /&gt;&lt;BR /&gt; #pragma omp parallel for private(i,j,k) schedule(static,chunk)&lt;BR /&gt; for (i=0; i &amp;lt; N; i++) {&lt;BR /&gt; for (j = 0; j&amp;lt;200000; j++) {&lt;BR /&gt; k = rand();&lt;BR /&gt; }&lt;BR /&gt;//    c&lt;I&gt; = a&lt;I&gt; + b&lt;I&gt;;&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; sec2 = time(NULL) - sec1;&lt;BR /&gt; printf("%ld seconds", sec2);&lt;BR /&gt; return 0;&lt;BR /&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;compiled using 'gcc -O3 -fopenmp workshare2.c -o workshare2' on gcc 4.3.2 on opensuse64 11.1&lt;BR /&gt;&lt;BR /&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/TIME.H&gt;&lt;/OMP.H&gt;&lt;/STDLIB.H&gt;&lt;/STDIO.H&gt;</description>
      <pubDate>Tue, 07 Jul 2009 20:59:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861894#M2447</guid>
      <dc:creator>inttel</dc:creator>
      <dc:date>2009-07-07T20:59:57Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861895#M2448</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/434775"&gt;inttel&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;a program for the sole purpose of trying to demonstrate the advantage of using 4 cores simultaneously is below.&lt;BR /&gt;however, it runs for 90 seconds on a 4 core xeon (3ghz) versus 2 seconds on a single core machine.&lt;BR /&gt;any hints greatly appreciated.&lt;BR /&gt;&lt;BR /&gt;[code section excised for sanity]&lt;BR /&gt;&lt;BR /&gt;compiled using 'gcc -O3 -fopenmp workshare2.c -o workshare2' on gcc 4.3.2 on opensuse64 11.1&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;The core of your problem is probably here:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;[cpp]#pragma omp parallel for private(i,j,k) schedule (static,chunk) 
   for (i=0; i &amp;lt; N; i++) {
      for (j = 0; j&amp;lt;200000; j++) {
         k = rand(); 
      } 
      // c&lt;I&gt; = a&lt;I&gt; + b&lt;I&gt;; 
   }[/cpp]&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;
&lt;P&gt;Though rand() not required to be reentrant and therefore not required to be thread safe (see &lt;A href="http://www.opengroup.org/onlinepubs/000095399/functions/rand.html"&gt;http://www.opengroup.org/onlinepubs/000095399/functions/rand.html&lt;/A&gt;), the fact is that some implementations provide thread safety by putting a lock in the function, which probablymeans that all those parallel invocations of rand() from the various threads are being serialized. That could go a long way to explaining the slowdown you report.&lt;BR /&gt;&lt;BR /&gt;For future reference, you might consider timingjust the code you're testing for parallel performance, rather than including the serial initialization section as part of the timed section as is done in this example.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jul 2009 22:03:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861895#M2448</guid>
      <dc:creator>robert-reed</dc:creator>
      <dc:date>2009-07-07T22:03:19Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861896#M2449</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/336004"&gt;Robert Reed (Intel)&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;P&gt;the fact is that some implementations provide thread safety by putting a lock in the function&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Sane implementations (Microsoft Visual C++) provide thread-safety by placing all the data to thread-local storage (TLS). This is a bit sub-optimal, but provides perfect scaling.&lt;BR /&gt;You may consider using well-designed self-contained random generator (like the one in boost), so that you will be able to create generator per thread on the stack.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 09 Jul 2009 12:49:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861896#M2449</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-07-09T12:49:01Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861897#M2450</link>
      <description>There is another possible problem. 25 rand() calls per task can be too small to outweigh parallelization overheads. Work per task must be some 10'000 machine cycles with current tools.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 09 Jul 2009 12:53:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861897#M2450</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-07-09T12:53:20Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861898#M2451</link>
      <description>&lt;P&gt;Additional about rand():&lt;BR /&gt;&lt;A title="Use of rand() in OpenMP parallel sections" href="http://www.viva64.com/blog/en/2009/05/27/98/" target="_blank"&gt;Use of rand() in OpenMP parallel sections&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jul 2009 10:13:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861898#M2451</guid>
      <dc:creator>AndreyKarpov</dc:creator>
      <dc:date>2009-07-10T10:13:50Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861899#M2452</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
Ok, I'm a noob at openmp but maybe specifying chunksize = 25 creates too many threads that choke your 4 cores. Try to only create a maximum of 2 * number of cores threads.&lt;BR /&gt;</description>
      <pubDate>Tue, 14 Jul 2009 14:38:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861899#M2452</guid>
      <dc:creator>Tudor</dc:creator>
      <dc:date>2009-07-14T14:38:02Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861900#M2453</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/435330"&gt;Tudor Serban&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; Ok, I'm a noob at openmp but maybe specifying chunksize = 25 creates too many threads that choke your 4 cores. Try to only create a maximum of 2 * number of cores threads.&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
No, setting chunk size doesn't affect the number of threads. However, you touch on a good point. Normally, with balanced work among chunks, the largest possible chunk size will be superior, at least when using static scheduling with affinity set.&lt;BR /&gt;In this case, it's not at all clear what the original poster was getting at. It's certainly not a normal usage of openmp. Maybe he wanted to see whether OpenMP inhibits the compiler from eliminating redundant loops.&lt;BR /&gt;</description>
      <pubDate>Tue, 14 Jul 2009 15:07:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861900#M2453</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2009-07-14T15:07:08Z</dc:date>
    </item>
    <item>
      <title>Re: openmp slower than single threaded</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861901#M2454</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;&lt;EM&gt;Normally, with balanced work among chunks, the largest possible chunk size will be superior&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;Only when all cores/HW threads are dedicated to running your app. When anthing else is running on the system then smaller chunk size &lt;EM&gt;may&lt;/EM&gt; be superior. Similar situation with nested levels and/or when using NOWAIT.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Tue, 14 Jul 2009 18:12:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/openmp-slower-than-single-threaded/m-p/861901#M2454</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-07-14T18:12:42Z</dc:date>
    </item>
  </channel>
</rss>

