<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic It was indeed the problem. I in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958997#M21421</link>
    <description>&lt;P&gt;It was indeed the problem. I thank you for your time and apologize for the inconvenience.&lt;/P&gt;</description>
    <pubDate>Fri, 18 Oct 2013 10:42:44 GMT</pubDate>
    <dc:creator>Jean-Philippe_H_</dc:creator>
    <dc:date>2013-10-18T10:42:44Z</dc:date>
    <item>
      <title>Scalability issue with fully parallel code</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958995#M21419</link>
      <description>&lt;P&gt;Hi there!&lt;/P&gt;
&lt;P&gt;I've been through some experiment on Xeon Phi recently and I'm hitting a serious scalability issue. The idea is to run a fully parallel-code with no memory access and with a growing number of processors, all running the exact same amount of operations. The workload is fixed, so therefore we should expect a constant execution time. Here is the code we are running:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;int main(int argc, char *argv[])&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; int i = 0; struct timespec start, end; uint64_t total; FILE *fd = stderr;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;&amp;nbsp; // Make sure the OpenMP threads are started before we start our time calculation&lt;/STRONG&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; #pragma omp parallel&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fprintf (stdout, "parallel %d\n", omp_get_thread_num ());&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;&amp;nbsp; // Start the experiment&lt;/STRONG&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; clock_gettime (CLOCK_REALTIME, &amp;amp;start);&lt;BR /&gt;&amp;nbsp;&amp;nbsp; #pragma omp parallel&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for ( i = 0 ; i &amp;lt; 1024*1024*128 ; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; asm volatile(&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "addl $1,%%eax;\n\t"&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; :&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; :&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; :"%eax"&lt;BR /&gt;&amp;nbsp;&amp;nbsp; );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&lt;STRONG&gt;&amp;nbsp;&amp;nbsp; // Stop the experiment&lt;/STRONG&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; clock_gettime (CLOCK_REALTIME, &amp;amp;end);&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; total = (end.tv_sec * 1e9 + end.tv_nsec)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; - (start.tv_sec * 1e9 + start.tv_nsec);&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; fprintf (fd, "time = %llu\n", (unsigned long long) total);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; return 0;&lt;BR /&gt;}&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Here is some info about our environment&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;MPSS Version:mpss_gold_update_3-2.1.6720-19 &amp;nbsp; (released: &amp;nbsp; September 10 2013)&lt;/LI&gt;
&lt;LI&gt;KMP_AFFINITY= scatter: This avoids possible hardware stalls due to HyperThreading contention. We want our code to be fully parallel.&lt;/LI&gt;
&lt;LI&gt;Number of OpenMP threads set via the OMP_NUM_THREADS environment variable&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Here are the results we got&lt;/STRONG&gt;&lt;BR /&gt;For 4 cores, in avg = 21.09s (with 25 runs)&lt;BR /&gt;For 8 cores, in avg = 43.29s (with 25 runs)&lt;BR /&gt;For 12 cores, in avg = 65.37s (with 25 runs)&lt;BR /&gt;For 16 cores, in avg = 87.24s (with 25 runs)&lt;BR /&gt;For 20 cores, in avg = 109.95s (with 25 runs)&lt;BR /&gt;For 24 cores, in avg = 132.18s (with 25 runss)&lt;BR /&gt;For 28 cores, in avg = 152.79s (with 25 runs)&lt;BR /&gt;For 32 cores, in avg = 175.32s (with 24 runs)&lt;BR /&gt;For 36 cores, in avg = 196.47s (with 24 runs)&lt;BR /&gt;For 40 cores, in avg = 218.72s (with 24 runs)&lt;BR /&gt;For 44 cores, in avg = 241.10s (with 24 runs)&lt;BR /&gt;For 48 cores, in avg = 263.49s (with 24 runs)&lt;BR /&gt;For 52 cores, in avg = 285.33s (with 24 runs)&lt;BR /&gt;For 56 cores, in avg = 307.35s (with 24 runs)&lt;/P&gt;
&lt;P&gt;We clearly see the lack of scalability here. So my question is: Are these numbers normal to you?&lt;/P&gt;
&lt;P&gt;Jp&lt;/P&gt;</description>
      <pubDate>Fri, 18 Oct 2013 09:42:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958995#M21419</guid>
      <dc:creator>Jean-Philippe_H_</dc:creator>
      <dc:date>2013-10-18T09:42:22Z</dc:date>
    </item>
    <item>
      <title>I think you should declare</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958996#M21420</link>
      <description>&lt;P&gt;I think you should declare the variable "i" inside the parallel. The way your code is at the moment it will be shared, which is not what you want...&lt;/P&gt;</description>
      <pubDate>Fri, 18 Oct 2013 10:05:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958996#M21420</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2013-10-18T10:05:20Z</dc:date>
    </item>
    <item>
      <title>It was indeed the problem. I</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958997#M21421</link>
      <description>&lt;P&gt;It was indeed the problem. I thank you for your time and apologize for the inconvenience.&lt;/P&gt;</description>
      <pubDate>Fri, 18 Oct 2013 10:42:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958997#M21421</guid>
      <dc:creator>Jean-Philippe_H_</dc:creator>
      <dc:date>2013-10-18T10:42:44Z</dc:date>
    </item>
    <item>
      <title>No problem at all. I'm glad</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958998#M21422</link>
      <description>&lt;P&gt;No problem at all. I'm glad the fix was that simple!&lt;/P&gt;</description>
      <pubDate>Fri, 18 Oct 2013 10:56:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958998#M21422</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2013-10-18T10:56:57Z</dc:date>
    </item>
    <item>
      <title>I believe that if the pragma</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958999#M21423</link>
      <description>&lt;P&gt;I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.&lt;/P&gt;
&lt;P&gt;This mistake is common and it is unfortunate that the OpenMP syntax allows it.&lt;/P&gt;</description>
      <pubDate>Fri, 18 Oct 2013 18:48:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/958999#M21423</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2013-10-18T18:48:12Z</dc:date>
    </item>
    <item>
      <title>If discussing "unfortunate"</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959000#M21424</link>
      <description>&lt;P&gt;If discussing "unfortunate" features, Cilk(tm) Plus allows a default shared cilk_for loop index for a .c file but not for a .cpp source file.&amp;nbsp; This is probably considered too obvious to document, but still a point on which mistakes are easily made.&lt;/P&gt;</description>
      <pubDate>Sat, 19 Oct 2013 00:20:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959000#M21424</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-10-19T00:20:25Z</dc:date>
    </item>
    <item>
      <title>"I believe that if the pragma</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959001#M21425</link>
      <description>&lt;P&gt;"I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private."&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Indeed true, however the semantics would also have been completely different! (Sharing a fixed amount of work between the threads as against doing all the work in each thread).&lt;/P&gt;
&lt;P&gt;My personal preference (even in non-OpenMP code) is to declare and initialise variablesin C/C++ &amp;nbsp;where they are first required unless they need a wider scope. That normally avoids the need to specify that they are "private" if/when you add OpenMP directives.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 09:26:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959001#M21425</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2013-10-21T09:26:38Z</dc:date>
    </item>
    <item>
      <title>Quote:John D. McCalpin wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959002#M21426</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;John D. McCalpin wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.&lt;/P&gt;
&lt;P&gt;This mistake is common and it is unfortunate that the OpenMP syntax allows it.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;This is exactly what happened in this case. I am used to use the "parallel for" directive a lot, and I automatically assumed the induction variable to be private in this case. I won't forget this trick twice ! Again, I am deeply sorry for this unfortunate, simple mistake.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 11:02:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959002#M21426</guid>
      <dc:creator>Jean-Philippe_H_</dc:creator>
      <dc:date>2013-10-21T11:02:36Z</dc:date>
    </item>
    <item>
      <title>Another reason for using for</title>
      <link>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959003#M21427</link>
      <description>&lt;P&gt;Another reason for using for(int i=... is that then the compiler optimization code will then know that "i" exits scope after the for statement. Meaning better opertunities for optimization (registerization) of i.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 12:31:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Scalability-issue-with-fully-parallel-code/m-p/959003#M21427</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2013-10-21T12:31:52Z</dc:date>
    </item>
  </channel>
</rss>

