<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic  then 5% - 10% faster are a in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160816#M7946</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;&amp;nbsp;then 5% - 10% faster are a significant benefit.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;I am writing code optimizers for many many years and didn't see a customer that was interested in a tool that gives 5% improvement. I thinks such a person wasn't born yet.&lt;/P&gt;&lt;P&gt;In my evaluations I always consider 5% difference in code executions ( either may - 5% improvement or otherwise ) to be nothing more than a "noise".&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;For #2, for a properly constructed piece of code it will be known in advance that either:&lt;/P&gt;&lt;P&gt;a) disambiguation isn't necessary (as conflicts cannot occur)&lt;BR /&gt;b) disambiguation is required (as conflicts can occur)&lt;/P&gt;&lt;P&gt;In the case of b), generally a simple test can be performed (e.g. overlapping array sections), and then an alternate form of a) can be selected (or resort to safe reduction methods).&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Disregarding the fact that such a crude way to perform disambiguation, in my my opinion, is unacceptable, even that you wouldn't be able to do - remember that the loop we are discussing is not countable, thus the loop count can not be easily determined and therefore array regions utilized by the loop may not be calculated.&lt;/P&gt;&lt;P&gt;In your case you allocated memory thus eliminating the need for disambiguation, but why do you think that this represents the real life situation. In the real life, as I know it, the routines are often called without any means to recreate the values of their parameters, thus making dynamic memory disambiguation necessary.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 11 Dec 2018 10:02:14 GMT</pubDate>
    <dc:creator>Livshin__David</dc:creator>
    <dc:date>2018-12-11T10:02:14Z</dc:date>
    <item>
      <title>Countable loops in openMP</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160784#M7914</link>
      <description>&lt;P&gt;Some OpenMP related documents state that in order for loop to be treated by OpenMP is must be “countable” providing different definitions for loop being “countable”:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;the number of iterations in the loop must be countable with an integer and loop use a fixed increment.&lt;/LI&gt;&lt;LI&gt;the loop count can be “determined” ( what does it mean “determined”? )&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Is it indeed the requirement of OpenMP? Or is it requirement of a specific compiler implementation of OpenMP?&lt;/P&gt;&lt;P&gt;Can the following code ( doesn't seems to be countable ) be parallelized by OpenMP ( note that the question is if the code can be pararallelized and not if there is a way to create a parallel equivalent of the code )&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;for ( i = 0; i &amp;lt; cnt; )
{
 x1 = 2.0 * x - 1.;
 if ( x1 &amp;lt; 1.0 )
 {
  i = i + 3;
  x = x*2.;
 }
 else // if ( x1 &amp;gt;= 1. )
 {
  i = i + 2;
  x = x/2.;
 }
}

Thank you,


David&lt;/PRE&gt;

&lt;P&gt;&lt;A href="https://community.intel.com/www.dalsoft.com" target="_blank"&gt;www.dalsoft.com&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 10:29:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160784#M7914</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-18T10:29:05Z</dc:date>
    </item>
    <item>
      <title>You would need to make the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160785#M7915</link>
      <description>&lt;P&gt;You would need to make the parallel for run for the maximum required count. Then you could make the body of the loop conditional on on i.&amp;nbsp; With static scheduling, this would imply work imbalance, so you could work with schedule(runtime) and try various choices by environment variable such as guided, auto, or dynamic.&amp;nbsp; With dynamic, at least, you should try various chunk sizes.&amp;nbsp; &amp;nbsp;Best choices will vary with number of cores, total number of iterations, and even which openmp library is in use.&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 12:45:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160785#M7915</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2018-11-18T12:45:57Z</dc:date>
    </item>
    <item>
      <title>"Countable" generally means</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160786#M7916</link>
      <description>&lt;P&gt;"Countable" generally means that the compiler can generate code that will compute the number of loop iterations without executing the loop.&lt;/P&gt;&lt;P&gt;Modifying the index variable outside of the increment expression in the "for" statement is often prohibited, though special cases can be countable (e.g., a simple unconditional increment of the index variable somewhere in the loop).&amp;nbsp;&lt;/P&gt;&lt;P&gt;In your case, the update(s) of the index variable are conditional, which is usually enough to prevent the loop from being countable.&amp;nbsp; To make it worse, the condition depends on a floating-point value, and that floating-point value is updated within the loop.&amp;nbsp;&amp;nbsp; The number of iterations in such a case may depend on the floating-point rounding mode in effect.&amp;nbsp; Determining the number of iterations in general code is equivalent to solving the Halting Problem, which is not possible. &lt;A href="https://en.wikipedia.org/wiki/Halting_problem" target="_blank"&gt;https://en.wikipedia.org/wiki/Halting_problem&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 19:28:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160786#M7916</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-11-18T19:28:45Z</dc:date>
    </item>
    <item>
      <title>"compiler can generate code</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160787#M7917</link>
      <description>&lt;P&gt;"compiler can generate code that will compute the number of loop iterations without executing the loop"&lt;/P&gt;&lt;P&gt;what kind of code? would it be acceptable to slice the code of the original loop to extract index generation and then loop that code ( and not the original loop )?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 19:57:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160787#M7917</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-18T19:57:49Z</dc:date>
    </item>
    <item>
      <title>That loop is not inherently</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160788#M7918</link>
      <description>&lt;P&gt;That loop is not inherently parallelizable (except in the cases of where the initial value of x is &amp;lt;=0.0, and in that case replace the loop with x=x*(2**cnt)). Otherwise, x has loop order dependencies.&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 13:55:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160788#M7918</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T13:55:00Z</dc:date>
    </item>
    <item>
      <title>Edit: &lt;= 1.0 / (2**cnt)</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160789#M7919</link>
      <description>&lt;P&gt;Edit: &amp;lt;= 1.0 / (2**cnt)&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 16:13:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160789#M7919</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T16:13:18Z</dc:date>
    </item>
    <item>
      <title>Although that is not a point</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160790#M7920</link>
      <description>&lt;P&gt;Although that is not a point of my post, but may be in the case you mentioned the result shall be: x*(2**(cnt/3)).&lt;/P&gt;&lt;P&gt;Also, when you say that loop is not parallelizabe, I guess you mean "not parallelizable by OpenMP". I wrote an autoparallelizer ( see &lt;A href="https://community.intel.com/www.dalsoft.com" target="_blank"&gt;www.dalsoft.com&lt;/A&gt; ) that can parallelize this loop ( in fact, much more complicated loop - the code in my post is a simplified version of that loop ).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 16:42:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160790#M7920</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-19T16:42:25Z</dc:date>
    </item>
    <item>
      <title>It would help if you give a</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160791#M7921</link>
      <description>&lt;P&gt;It would help if you give a link to the direct page that illustrates how the loop in #1 is auto-parallelized.&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 18:23:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160791#M7921</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T18:23:51Z</dc:date>
    </item>
    <item>
      <title>The loop that I refered to (</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160792#M7922</link>
      <description>&lt;P&gt;The loop that I refered to ( of which the code in my post is a simplified version ) is:&lt;/P&gt;&lt;P&gt;for ( i = 0; i &amp;lt; cnt; )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = exp( x1 ) * cos( x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;I will present the results on the "Compiler, Architecture, And Tools Conference", see&lt;/P&gt;&lt;P&gt;&lt;A href="https://software.intel.com/en-us/event/compiler-conference/2018/schedule" target="_blank"&gt;https://software.intel.com/en-us/event/compiler-conference/2018/schedule&lt;/A&gt;&lt;/P&gt;&lt;P&gt;After the presentation ( December 17 ) would be glad to show you the code.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 18:50:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160792#M7922</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-19T18:50:53Z</dc:date>
    </item>
    <item>
      <title>Please add this to your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160793#M7923</link>
      <description>&lt;P&gt;Please add this to your calendar such that those here&amp;nbsp;not attending the conference can see and comment.&lt;/P&gt;&lt;P&gt;While having a compiler auto-parallelize the #9 loop can be parallelized.&lt;/P&gt;&lt;P&gt;In the serial loop,&amp;nbsp;i always increments, and thus will not produce duplicate indices for &lt;I&gt;. While is has not been disclosed, it may be a requirement that the initial x be &amp;gt; 0.0. Therefor the values inserted into b or a would not be 0.0&lt;/I&gt;&lt;/P&gt;&lt;P&gt;This untested code may be effective:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;atomic&amp;lt;double&amp;gt; fix_a[cnt], fix_b[cnt];
atomic&amp;lt;int&amp;gt; fill_a,fill_b;
...

#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i &amp;lt; cnt; ++i)
      a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      b&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_b&lt;I&gt; = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_a = -1;
fill_b = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i &amp;lt; cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 &amp;lt;  1.0 )
      {
        fix_b&lt;I&gt; = x1; // b&lt;I&gt; = exp( x1 ) * cos( x1 );
        fill_b = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 &amp;gt;= 1. )
      {
        fix_a&lt;I&gt; = x1; // a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
        fill_a = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i &amp;lt; cnt; )
    fill_a = cnt;
    fill_b = cnt;
  } // #pragma omp master
  // all threads here
  int empty_a = 0;
  int empty_b = 0;
  // until done
  for(;empty_a &amp;lt; cnt || empty_b &amp;lt; cnt;)
  {
    while(empty_a &amp;lt;= fill_a &amp;amp;&amp;amp; empty_a &amp;lt; cnt)
    {
      if(fix_a[empty_a] != 0.0)
      {
        double x1 = fix_a[empty_a].exchange(0.0);
        if(x1)
        {
          a[empty_a] = sqrt( x1 ) * log( 1 / x1 );
        } // if(x1)
      } // if(fix_a[empty_a])
      ++empty_a;
    }
    while(empty_b &amp;lt;= fill_b &amp;amp;&amp;amp; empty_b &amp;lt; cnt)
    {
      if(fix_b[empty_b] != 0.0)
      {
        double x1 = fix_b[empty_b].exchange(0.0);
        if(x1)
        {
          b[empty_b] = exp( x1 ) * cos( x1 );
        } // if(x1)
      } // if(fix_b[empty_b])
      ++empty_b;
    }
  } // for(;empty_a &amp;lt; cnt || empty_b &amp;lt; cnt;)
} // #pragma parallel&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Depending on your needs, you may want to insert _mm_pause() when waiting for work.&lt;/P&gt;
&lt;P&gt;Keep in mind you may need to modify the code.&lt;/P&gt;
&lt;P&gt;Also, the amount of work needs to be sufficient to amortize the overhead of starting/resuming the thread team. (IOW number of iterations is relatively large).&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 21:03:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160793#M7923</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T21:03:00Z</dc:date>
    </item>
    <item>
      <title>It should be noted that the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160794#M7924</link>
      <description>&lt;P&gt;It should be noted that the wipe of fix_a and fix_b need only be done once due to the pickers resetting to 0.0 with exchange. As to what to do with a and b there is insufficient information in your postings.&lt;/P&gt;&lt;P&gt;You would want to assure that the threads performing the picking were on separate cores (IOW not with multiple threads within a core).&lt;/P&gt;&lt;P&gt;Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 20 Nov 2018 15:22:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160794#M7924</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-20T15:22:49Z</dc:date>
    </item>
    <item>
      <title>As to what to do with a and b</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160795#M7925</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;As to what to do with a and b there is insufficient information in your postings.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;a and b assumed to be parameters to the routine that contains the loop.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;No, I came up with this in an attempt to show the functionality of the auto parallelizer, specifically the ability to&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;calculate loop count ( of not a countable loop, thus not supported by OpenMP )&lt;/LI&gt;&lt;LI&gt;resolve memory dependency for memory writes a and b&lt;/LI&gt;&lt;LI&gt;create code to calculate values needed at the entry to a thread: x and i&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;If you would like, I will send you ( after the conference ) my presentation that explains the above.&lt;/P&gt;&lt;P&gt;As to the iteration counts: in the test suite I use, the iteration count cnt is set to be 100000000 ( 8 zeros ) - which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.&lt;/P&gt;&lt;P&gt;Also note the use of transcendentals - this is done in order to give some weight to the loop; otherwise the overhead of doing the above will make auto-parallelization to be not worth it. I ran some tests adding more calls to "expensive" routines and saw how it improves the performance of the parallelized code. You may find the following article helpful to clarify that:&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_parallelization.pdf" target="_blank"&gt;http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_parallelization.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class="s3gt_translate_tooltip_mini_box" id="s3gt_translate_tooltip_mini" is_bottom="true" is_mini="true" style="background:initial !important; border-collapse:initial !important; border-radius:initial !important; border-spacing:initial !important; border:initial !important; box-sizing:initial !important; color:inherit !important; direction:ltr !important; display:initial !important; flex-direction:initial !important; font-family:X-LocaleSpecific,sans-serif,Tahoma,Helvetica !important; font-size:13px !important; font-weight:initial !important; height:initial !important; left:433px; letter-spacing:initial !important; line-height:13px !important; margin-bottom:0px; margin-left:0px; margin-right:0px; margin-top:0px; max-height:initial !important; max-width:initial !important; min-height:initial !important; min-width:initial !important; opacity:0.85; outline:initial !important; overflow-wrap:initial !important; padding:initial !important; position:absolute; table-layout:initial !important; text-align:initial !important; text-shadow:initial !important; top:53px; vertical-align:top !important; white-space:inherit !important; width:initial !important; word-break:initial !important; word-spacing:initial !important"&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_logo" title="Translate selected text"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_sound" title="Play" title_play="Play" title_stop="Stop"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_copy" title="Copy text to Clipboard"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;
&lt;LINK href="moz-extension://3143799d-314a-4d30-af27-8775c53b6a6e/skin/s3gt_tooltip_mini.css" rel="stylesheet" type="text/css" /&gt;
&lt;STYLE media="print" type="text/css"&gt;#s3gt_translate_tooltip_mini { display: none !important; }
&lt;/STYLE&gt;</description>
      <pubDate>Tue, 20 Nov 2018 16:19:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160795#M7925</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-20T16:19:06Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;which makes me wonder how</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160796#M7926</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.&lt;/P&gt;&lt;P&gt;To reduce the additional arrays to 1 array, it is known that the X1's generated are all &amp;gt; 0.0. Therefore the sign could be used to indicate the path.&lt;/P&gt;&lt;P&gt;As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).&lt;/P&gt;&lt;P&gt;Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this. This said, one (you) could have the compiler identify this type of loop (something similar to a convergence loop), and produce a preamble for single path, then enter the flip/flop for the remainder of the convergence.&lt;/P&gt;&lt;P&gt;Simplified code of my prior post:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;atomic&amp;lt;double&amp;gt; fix_x[cnt];
atomic&amp;lt;int&amp;gt; fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i &amp;lt; cnt; ++i)
      a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      b&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_x&lt;I&gt; = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i &amp;lt; cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 &amp;lt;  1.0 )
      {
        fix_x&lt;I&gt; = -x1; // b&lt;I&gt; = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 &amp;gt;= 1. )
      {
        fix_x&lt;I&gt; = x1; // a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i &amp;lt; cnt; )
    fill_x = cnt;
  } // #pragma omp master
  // all threads here
  int empty_x = 0;
  // until done
  for(;empty_x &amp;lt; cnt;)
  {
    while(empty_x &amp;lt;= fill_x)
    {
      if(fix_x[empty_x] != 0.0)
      {
        double x1 = fix_x[empty_x].exchange(0.0);
        if(x1)
        {
          if(x1 &amp;gt; 0.0)
            a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
          else
            b[empty_b] = exp( -x1 ) * cos( -x1 );
        } // if(x1)
      } // if(fix_a[empty_x])
      ++empty_x;
    }
  } // for(;empty_x &amp;lt; cnt;)
} // #pragma parallel&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Perhaps you could compare the above with your auto-generated code.&lt;/P&gt;
&lt;P&gt;Note, if the distribution of the modified cells is somewhat random, then this code may be better:&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;atomic&amp;lt;double&amp;gt; fix_x[cnt];
atomic&amp;lt;int&amp;gt; fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i &amp;lt; cnt; ++i)
      a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      b&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_x&lt;I&gt; = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  if(iThread == 0)
  {
    for (int i = 0; i &amp;lt; cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 &amp;lt;  1.0 )
      {
        fix_x&lt;I&gt; = -x1; // b&lt;I&gt; = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 &amp;gt;= 1. )
      {
        fix_x&lt;I&gt; = x1; // a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i &amp;lt; cnt; )
    fill_x = cnt;
  } // if(iThread == 0)
  if(nThreads &amp;gt; 1)
  {
    --iThread;
    --nThreads;
  }
  if(iThread &amp;gt;=0)
  {
    int empty_x = 0;
    // until done
    for(;empty_x &amp;lt; cnt;)
    {
      while(empty_x &amp;lt;= fill_x)
      {
        if(empty_x % nThreads == iThread)
        {
          double x1 = fix_x[empty_x];
          if(x1 != 0.0)
          {
            fix_x[empty_x] = 0.0;
            if(x1 &amp;gt; 0.0)
              a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
            else
              b[empty_b] = exp( -x1 ) * cos( -x1 );
          } // if(x1 != 0.0)
        } // if(empty_x % nThreads == iThread)
        ++empty_x;
      }
    } // for(;empty_x &amp;lt; cnt;)
  }
} // #pragma parallel&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 20 Nov 2018 17:08:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160796#M7926</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-20T17:08:00Z</dc:date>
    </item>
    <item>
      <title>As to which is faster (your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160797#M7927</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Your code wouldn't run on my machine as a and b, with the cnt as I specified, take all the memory. The rule for writing parallel code is that everything shall be done in-place, no huge memory allocations as it may be no memory available.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;I fail to understand your preoccupation with the value of x. Autoparallelizer is not bothered by that at all. Of course user should be careful to use the values of x that don't cause the exception(s), but the same exception will occur in the sequential and parallel codes. Also, should it be clear how to parallelize sequential code, I would be out of business.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class="s3gt_translate_tooltip_mini_box" id="s3gt_translate_tooltip_mini" is_mini="true" style="background:initial !important; border-collapse:initial !important; border-radius:initial !important; border-spacing:initial !important; border:initial !important; box-sizing:initial !important; color:inherit !important; direction:ltr !important; display:initial !important; flex-direction:initial !important; font-family:X-LocaleSpecific,sans-serif,Tahoma,Helvetica !important; font-size:13px !important; font-weight:initial !important; height:initial !important; left:369px; letter-spacing:initial !important; line-height:13px !important; margin-bottom:0px; margin-left:0px; margin-right:0px; margin-top:0px; max-height:initial !important; max-width:initial !important; min-height:initial !important; min-width:initial !important; opacity:0.75; outline:initial !important; overflow-wrap:initial !important; padding:initial !important; position:absolute; table-layout:initial !important; text-align:initial !important; text-shadow:initial !important; top:258px; vertical-align:top !important; white-space:inherit !important; width:initial !important; word-break:initial !important; word-spacing:initial !important"&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_logo" title="Translate selected text"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_sound" title="Play" title_play="Play" title_stop="Stop"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_copy" title="Copy text to Clipboard"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;
&lt;LINK href="moz-extension://3143799d-314a-4d30-af27-8775c53b6a6e/skin/s3gt_tooltip_mini.css" rel="stylesheet" type="text/css" /&gt;
&lt;STYLE media="print" type="text/css"&gt;#s3gt_translate_tooltip_mini { display: none !important; }
&lt;/STYLE&gt;</description>
      <pubDate>Tue, 20 Nov 2018 17:46:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160797#M7927</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-20T17:46:50Z</dc:date>
    </item>
    <item>
      <title>Ok, then here is simplified</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160798#M7928</link>
      <description>&lt;P&gt;Ok, then here is simplified parallel loop in OpenMP&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#pragma omp parallel
{
&amp;nbsp; int iThread = omp_get_thread_num();
&amp;nbsp; int nThreads = omp_get_num_threads();
&amp;nbsp; int interval = 0;
&amp;nbsp; // all threads perform
&amp;nbsp;&amp;nbsp; for (int i = 0; i &amp;lt; cnt; ++interval)
&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1.; // * is relatively fast
&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(iterval%nThreads == iThread)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = exp( x1 ) * cos( x1 ); //
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(interval%nThreads == iThread)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp; }
}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 20 Nov 2018 18:25:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160798#M7928</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-20T18:25:00Z</dc:date>
    </item>
    <item>
      <title>To the best of my knowledge,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160799#M7929</link>
      <description>&lt;P&gt;To the best of my knowledge, the loop in your last post is not canonical and therefore wouldn't be accepted by OpenMP; gcc should give compilation error ( when using -fopenmp ) requesting explicit loop increment.&lt;/P&gt;</description>
      <pubDate>Wed, 21 Nov 2018 06:03:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160799#M7929</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-21T06:03:38Z</dc:date>
    </item>
    <item>
      <title>Please disregard the last</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160800#M7930</link>
      <description>&lt;P&gt;Please disregard the last post - I missed the fact that "#pragma omp" applied to a block ( and not to a loop ).&lt;/P&gt;&lt;P&gt;Sorry.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Nov 2018 19:58:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160800#M7930</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-21T19:58:16Z</dc:date>
    </item>
    <item>
      <title>Here is the code that works,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160801#M7931</link>
      <description>&lt;P&gt;Here is the code that works, however, the compute time of the "DoWork" emulation is rather short.&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;// GoofyLoop.cpp
//

#include "stdafx.h"
#include &amp;lt;iostream&amp;gt;

#include &amp;lt;immintrin.h&amp;gt;
#include &amp;lt;math.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

const __int64 N = 100000000;
double* a;
double* b;
const double typicalX = 3.141592653589793;

void Serial(void)
{
	double x = typicalX;
	double x1;
	__int64 cnt = N;
	for (__int64 i = 0; i &amp;lt; cnt;)
	{
		x1 = 2.0 * x - 1.;
		if (x1 &amp;lt;  1.0)
		{
			b&lt;I&gt; = exp(x1) * cos(x1);
			i = i + 3;
			x = x*2.;
		}
		else  // if ( x1 &amp;gt;= 1. )
		{
			a&lt;I&gt; = sqrt(x1) * log(1 / x1);
			i = i + 2;
			x = x / 2.;
		}
	}
}

void Parallel(void)
{
#pragma omp parallel
	{
		int iThread = omp_get_thread_num();
		int nThreads = omp_get_num_threads();
		double x = typicalX;
		double x1;
		__int64 cnt = N;
		__int64 interval = 0;
		for (__int64 i = 0; i &amp;lt; cnt; ++interval)
		{
			x1 = 2.0 * x - 1.;
			if (x1 &amp;lt; 1.0)
			{
				if (interval%nThreads == iThread)
				{
					b&lt;I&gt; = exp(x1) * cos(x1);
				}
				i = i + 3;
				x = x*2.;
			}
			else  // if ( x1 &amp;gt;= 1. )
			{
				if (interval%nThreads == iThread)
				{
					a&lt;I&gt; = sqrt(x1) * log(1 / x1);
				}
				i = i + 2;
				x = x / 2.;
			}
		}
	}
}
int _tmain(int argc, _TCHAR* argv[])
{
	a = (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
	b = (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
#pragma omp parallel
	{
#pragma omp master
		{
			std::cout &amp;lt;&amp;lt; "nThreads = " &amp;lt;&amp;lt; omp_get_num_threads() &amp;lt;&amp;lt; std::endl;
		}
	}
#pragma omp parallel for
	for (int i = 0; i &amp;lt; N; ++i)
	{
		a&lt;I&gt; = 0.0;
		b&lt;I&gt; = 0.0;
	}

	for (int rep = 0; rep &amp;lt; 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Serial();
		unsigned __int64 t1 = _rdtsc();
		std::cout &amp;lt;&amp;lt; "Serial ticks = " &amp;lt;&amp;lt; t1 - t0 &amp;lt;&amp;lt; std::endl;
	}
	std::cout &amp;lt;&amp;lt; std::endl;
	for (int rep = 0; rep &amp;lt; 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Parallel();
		unsigned __int64 t1 = _rdtsc();
		std::cout &amp;lt;&amp;lt; "Parallel ticks = " &amp;lt;&amp;lt; t1 - t0 &amp;lt;&amp;lt; std::endl;
	}
	return 0;
}&lt;/I&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;On a 4 core w/ HT, Core i7 2700K, running 1 thread per core:&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;Threads = 4
Serial ticks = 3076054566
Serial ticks = 2543030436
Serial ticks = 2547671985

Parallel ticks = 2116263348
Parallel ticks = 2116889788
Parallel ticks = 2128250491&lt;/PRE&gt;

&lt;P&gt;Marginal.&lt;/P&gt;
&lt;P&gt;3 Threads:&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;hreads = 3
Serial ticks = 2585388714
Serial ticks = 2603521131
Serial ticks = 2602569400

Parallel ticks = 2098115233
Parallel ticks = 2108614224
Parallel ticks = 2098695600&lt;/PRE&gt;

&lt;P&gt;Slightly better.&lt;/P&gt;
&lt;P&gt;The Do Work section is relatively small computation between memory writes, therefore it seems that for this example, the degree of (productive) parallelization is dependent upon the memory subsystem.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 22 Nov 2018 03:24:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160801#M7931</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-22T03:24:24Z</dc:date>
    </item>
    <item>
      <title>I created the following test</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160802#M7932</link>
      <description>&lt;P&gt;I created the following test program:&lt;/P&gt;&lt;P&gt;#include &amp;lt;stdio.h&amp;gt;&lt;BR /&gt;#include &amp;lt;math.h&amp;gt;&lt;/P&gt;&lt;P&gt;#define N 100000000&lt;/P&gt;&lt;P&gt;double a&lt;N&gt;, b&lt;N&gt;;&lt;/N&gt;&lt;/N&gt;&lt;/P&gt;&lt;P&gt;double foo( double *a, double *b, double x, unsigned int cnt )&lt;BR /&gt;&amp;nbsp;{&lt;BR /&gt;&amp;nbsp; double x1;&lt;BR /&gt;&amp;nbsp; unsigned int i;&lt;/P&gt;&lt;P&gt;asm( "#.dco_start" );&lt;/P&gt;&lt;P&gt;&amp;nbsp; for ( i = 0; i &amp;lt; cnt; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = exp( x1 ) * cos( x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = x1;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = sqrt( x1 ) * log( 1. / x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = x1;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;asm( "#.dco_end" );&lt;/P&gt;&lt;P&gt;&amp;nbsp; return x;&lt;/P&gt;&lt;P&gt;&amp;nbsp;}&lt;/P&gt;&lt;P&gt;// by Jim Dempsey&lt;BR /&gt;double foo_Jim( double *a, double *b, double x, unsigned int cnt )&lt;BR /&gt;&amp;nbsp;{&lt;BR /&gt;&amp;nbsp; double x1;&lt;BR /&gt;&amp;nbsp; unsigned int i;&lt;/P&gt;&lt;P&gt;#if 0&lt;/P&gt;&lt;P&gt;#pragma omp parallel&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp; int iThread = omp_get_thread_num();&lt;BR /&gt;&amp;nbsp; int nThreads = omp_get_num_threads();&lt;BR /&gt;&amp;nbsp; int interval = 0;&lt;BR /&gt;&amp;nbsp; // all threads perform&lt;BR /&gt;&amp;nbsp;&amp;nbsp; for ( i = 0; i &amp;lt; cnt; ++interval)&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1.; // * is relatively fast&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(interval%nThreads == iThread)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = exp( x1 ) * cos( x1 ); //&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(interval%nThreads == iThread)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;}&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;#endif&lt;/P&gt;&lt;P&gt;return x;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;int main()&lt;BR /&gt;&amp;nbsp;{&lt;BR /&gt;&amp;nbsp; double rslt, rslt1;&lt;BR /&gt;&amp;nbsp; unsigned int i;&lt;/P&gt;&lt;P&gt;&amp;nbsp; for ( i = 0; i &amp;lt; N; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = 0.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = 0.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; for( i = 0; i &amp;lt; 10; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt = foo( a, b, 3.1415, N - 100 );&lt;BR /&gt;//&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt = foo_Jim( a, b, 3.1415, N - 100 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/P&gt;&lt;P&gt;&amp;nbsp; rslt1 = 0.;&lt;BR /&gt;&amp;nbsp; for ( i = 0; i &amp;lt; N; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt1 += a&lt;I&gt;;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt1 += b&lt;I&gt;;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; printf( "rslt&amp;nbsp; %f&amp;nbsp;&amp;nbsp; %f\n", rslt, rslt1 );&lt;/P&gt;&lt;P&gt;&amp;nbsp;}&lt;/P&gt;&lt;P&gt;and generated 3 executables ( altering code as necessary ):&lt;/P&gt;&lt;P&gt;serial code: loop&lt;/P&gt;&lt;P&gt;parallel code generated by my auto-parallelizer: loop_dco&lt;/P&gt;&lt;P&gt;your code: loop_Jim&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Each program was executed&amp;nbsp; 3 times on the E8600 @ 3.33GHz, 2 cores Linux machine. The execution was under "time" command ( e.g. "time ./loop" ) and reported time ( see bellow ) is 'real' time produced by the "time" command that is neither the fastest nor the slowest out of 3 executions attepmted.&lt;/P&gt;&lt;P&gt;The execution times ( in seconds ) are:&lt;/P&gt;&lt;P&gt;loop: 14.99&lt;/P&gt;&lt;P&gt;loop_dco: 11.75&lt;/P&gt;&lt;P&gt;loop_Jim: 10.48&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As you can see the program generates and prints checksums - for loop and loop_dco these chechsums always ( for every invocation ) fully agreed, loop_Jim was generating different checksums for every separate run (?).&lt;/P&gt;&lt;P&gt;Few words about the code:&lt;/P&gt;&lt;P&gt;loop_Jim assumes that a and b are not-overlaping memory regions and therefore may be use in parallel code; loop_dco doesnt make such an assumption and generates code to verify that dynamicaly at run time - overhead of up to 20%.&lt;/P&gt;&lt;P&gt;loop_jim doesnt preserve the value of x that shall be returned.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you like, I would be glad to send you my ( Linux ) executables.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Nov 2018 13:01:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160802#M7932</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-22T13:01:39Z</dc:date>
    </item>
    <item>
      <title>Knowing this is now memory</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160803#M7933</link>
      <description>&lt;P&gt;Knowing this is now memory access&amp;nbsp;bound, performing aligned allocation and organizing stores on cache lines, yields a little more improvement:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;const int CacheLineSize = 64;
const int doublesInCacheLine = CacheLineSize / sizeof(double);
...
	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
	b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
...
				if ((i / doublesInCacheLine)%nThreads == iThread)
...
				if ((i / doublesInCacheLine) % nThreads == iThread)
&lt;/N&gt;&lt;/N&gt;&lt;/PRE&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;nThreads = 4
Serial ticks = 2665960090
Serial ticks = 2645705503
Serial ticks = 2551374815

Parallel ticks = 1949592124
Parallel ticks = 1957116967
Parallel ticks = 2024346334&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 22 Nov 2018 15:12:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160803#M7933</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-22T15:12:50Z</dc:date>
    </item>
  </channel>
</rss>

