<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic David, in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160820#M7950</link>
    <description>&lt;P&gt;David,&lt;/P&gt;&lt;P&gt;What you have done for auto-parallelism is quite interesting. And once mature (bugs shaken out), very impressive. As to if this technology can be incorporated into existing compilers (IP acquired) or sold as a stand-alone post-processor, I have some prior experience on this. I suspect that either nothing will happen or you might get some interest in IP acquisition. Selling or consulting for post-processing may get but a few interested parties.&lt;/P&gt;&lt;P&gt;Back in the stone age (1990-1992) my minicomputer business faded away due to PCs disrupting the minicomputer business. Back in those days, a large portion of the market still ran MS-DOS in Real Mode or using a DOS Extender in a transitional Real/Protected mode. Windows (Protected Mode) was still becoming of age. Looking to form a new PC based software company, I'd asked myself what could I offer.&lt;/P&gt;&lt;P&gt;The first product was to port the Digital Equipment Corporation text editor TECO to the PC. This text editor is more capable than AWK, Brief and others,&amp;nbsp;at pattern matching and MUNGing up text. Mung is a self-referencing acronym Mung Until No Good. Due to the steep learning curve (too many features)&amp;nbsp;there was&amp;nbsp;had little interest in this product.&lt;/P&gt;&lt;P&gt;Then I though, well, how can I get interest in this... ah... incorporate it in a tool for programmers. At that time I had gained considerable programming experience using Borland Turbo-C, and then&amp;nbsp;Borland C++. During debugging, I'd often looked at the disassembly code and noticed that although the Borland compilers produced optimized coded, that in places the code was non-optimal. I guess it was good enough for Borland as their compiler was better than Microsoft's.&lt;/P&gt;&lt;P&gt;I thought to myself, hmm..., If I used the Borland compiler to produce assembler output, I could use my TECO editor to search out the non-optimal code sections and tidy them up. While the code patterns tended to be the same, the register assignments and branch labels would differ. TECO is capable of fuzzy match so I could search out a fuzzy large pattern (multiple lines) containing smaller fuzzy match patterns (registers and labels), then massage the fuzzy stuff and produce optimized replacement code. I called this the PeepHole Optimizer.&lt;/P&gt;&lt;P&gt;The above procedure was performed again for different non-optimal code sequences. After I thought I found all, or&amp;nbsp;enough of, the low hanging fruit. I re-examined my output files only to discover that may tweaked-up code produce some sections that could benefit from further optimizations using the same fuzzy match and rewrite macros. IOW, I could pass my output back through the same editing macros to yield further improvements (and add new macros to handle exception cases). A typical example of MUNG (without the NG).&lt;/P&gt;&lt;P&gt;The end process produce code that was typically 15% faster. I thought this would sell as a post-processor. I found little interest (maybe 100 sales). Some of my users asked if this would work on MS C/C++. So I gave it a try and to my surprise although the assembler code looked quite different between Borland and Microsoft, the fuzzy pattern matching found and corrected much of the code produced by the MS compiler and yielded similar performance improvements.&lt;/P&gt;&lt;P&gt;At this time the Intel 80386 CPU&amp;nbsp;was quite popular. You had 16-bit MS-DOS, MS-DOS with DOS Extender (protected mode) and Windows (protected mode). IOW we had the MS-DOS/DOS-Extender market and the Windows market. MS-DOS could live in the low 1MB of RAM but was capable of using the additional RAM as a RAM-DISK via a driver. The DOS extenders could use the extra memory but only in 16-bit selector model. The CPU though was capable of 32-bit Flat Model mode but DOS could not run in that mode. Some of us programmers figured out that the CPU DS and ES selectors could be pre-conditioned to have Huge Granularity bit set permitting addressing&amp;nbsp;indexed/based by 32-bit registers. The only way to perform the indexing was by modifying assembly code to instruct the assembler to emit the 32-bit address prefix byte. Borland TASM assembler would do this. At that time users could write code in assembler that would run 16-bit MS-DOS code with 32-bit addressing capability for data. The only problem was this required assembly coding.&lt;/P&gt;&lt;P&gt;Now then, I thought, I can optimize the assembly output of Borland and Microsoft C/C++ code by way of the assembler output, so with enhancement to the PeepHole optimizer I could add Flat Model programming. Side bar: at that time a 16-bit program (C/C++) could have a pointer, a&amp;nbsp;16-bit address off of the DS (Data Segment), a&amp;nbsp;"far&amp;nbsp;pointer" two 16-bit words one to be loaded into the ES (Extra Segment) register and the other to be&amp;nbsp;used as a 16-bit index (from this pointer you could byte offset index +/- 32KB), and finally a "huge pointer", which&amp;nbsp;has the two 16-bit words like the far pointer, but the compiler could generate code that the index would manipulate both the segment and offset thus permitting arrays larger than 32KB (but still confined within the 1MB box).&lt;/P&gt;&lt;P&gt;What I did was to pattern match for huge pointer manipulation, and convert this to 32-bit Flat Model, inclusive of using the CPU's SIB (Scale Index Base) capability. This made for a vast reduction in code that manipulated huge pointers and permitted 32-bit indexing. The next problem though was the C Runtime Library heap manager was locked in the 1MB box (640KB or 720KB)&amp;nbsp;as presented by MS-DOS. Because I could manipulate huge pointers, the only remaining issue was with the allocation. While it was an easy process to locate the malloc/free, identifying if the pointer receiving the malloc or supplied to free was a huge pointer or far pointer was problematic. I knew the debugger had this information, so I required the C/C++ compiler to add the debug symbol table to the assembler output, and text macro code to search the debug symbol table to find out if the pointer was attributed as huge. If so, the allocation/free was redirected to my own memory manager who's heap could expand to all of physical RAM.&lt;/P&gt;&lt;P&gt;Now I had a product that could run in MS-DOS who's code, though restricted to the low 640KB, had access to all of physical RAM as data (4MB, 8MB, 16MB, ...) and executed 2x to 4x faster. Great, or so I thought. There were two marketing problems with this:&lt;/P&gt;&lt;P&gt;a) I was a little company and few businesses and/or research institute would deal with a small outfit (non-major vendor)&lt;BR /&gt;b) The product couldn't work inside Windows&lt;/P&gt;&lt;P&gt;The problem in Windows was a result of a poor decision on Intel's part when they implemented the Virtual 8086 Machine mode (V86) to run MS-DOS applications (this is a quazi Real Mode). While the memory management hardware supports a process mapped (via selectors) and via the Page Tables be place on arbitrary pages, someone at Intel though it wise that any index used that resulted in an index beyond 1MB from the base of the Selector (in V86 mode) was to generate a GP fault... regardless as to if that selector contained a size larger than 1MB. IMHO there was absolutely no reason for this seeing that for a typical MS-DOS session the Selector could be allocated with a size of 1MB and that attempt to address outside of the box would generate a Page Fault. As such my atypical MS-DOS session while it could manipulate the granularity and potentially acquire the additional RAM for the Selector, it was prohibited from addressing that memory.&lt;/P&gt;&lt;P&gt;What this long story is about is a cautionary tail that sometimes superior products fail.&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Thu, 20 Dec 2018 16:37:26 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2018-12-20T16:37:26Z</dc:date>
    <item>
      <title>Countable loops in openMP</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160784#M7914</link>
      <description>&lt;P&gt;Some OpenMP related documents state that in order for loop to be treated by OpenMP is must be “countable” providing different definitions for loop being “countable”:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;the number of iterations in the loop must be countable with an integer and loop use a fixed increment.&lt;/LI&gt;&lt;LI&gt;the loop count can be “determined” ( what does it mean “determined”? )&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Is it indeed the requirement of OpenMP? Or is it requirement of a specific compiler implementation of OpenMP?&lt;/P&gt;&lt;P&gt;Can the following code ( doesn't seems to be countable ) be parallelized by OpenMP ( note that the question is if the code can be pararallelized and not if there is a way to create a parallel equivalent of the code )&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;for ( i = 0; i &amp;lt; cnt; )
{
 x1 = 2.0 * x - 1.;
 if ( x1 &amp;lt; 1.0 )
 {
  i = i + 3;
  x = x*2.;
 }
 else // if ( x1 &amp;gt;= 1. )
 {
  i = i + 2;
  x = x/2.;
 }
}

Thank you,


David&lt;/PRE&gt;

&lt;P&gt;&lt;A href="https://community.intel.com/www.dalsoft.com" target="_blank"&gt;www.dalsoft.com&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 10:29:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160784#M7914</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-18T10:29:05Z</dc:date>
    </item>
    <item>
      <title>You would need to make the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160785#M7915</link>
      <description>&lt;P&gt;You would need to make the parallel for run for the maximum required count. Then you could make the body of the loop conditional on on i.&amp;nbsp; With static scheduling, this would imply work imbalance, so you could work with schedule(runtime) and try various choices by environment variable such as guided, auto, or dynamic.&amp;nbsp; With dynamic, at least, you should try various chunk sizes.&amp;nbsp; &amp;nbsp;Best choices will vary with number of cores, total number of iterations, and even which openmp library is in use.&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 12:45:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160785#M7915</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2018-11-18T12:45:57Z</dc:date>
    </item>
    <item>
      <title>"Countable" generally means</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160786#M7916</link>
      <description>&lt;P&gt;"Countable" generally means that the compiler can generate code that will compute the number of loop iterations without executing the loop.&lt;/P&gt;&lt;P&gt;Modifying the index variable outside of the increment expression in the "for" statement is often prohibited, though special cases can be countable (e.g., a simple unconditional increment of the index variable somewhere in the loop).&amp;nbsp;&lt;/P&gt;&lt;P&gt;In your case, the update(s) of the index variable are conditional, which is usually enough to prevent the loop from being countable.&amp;nbsp; To make it worse, the condition depends on a floating-point value, and that floating-point value is updated within the loop.&amp;nbsp;&amp;nbsp; The number of iterations in such a case may depend on the floating-point rounding mode in effect.&amp;nbsp; Determining the number of iterations in general code is equivalent to solving the Halting Problem, which is not possible. &lt;A href="https://en.wikipedia.org/wiki/Halting_problem" target="_blank"&gt;https://en.wikipedia.org/wiki/Halting_problem&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 19:28:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160786#M7916</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-11-18T19:28:45Z</dc:date>
    </item>
    <item>
      <title>"compiler can generate code</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160787#M7917</link>
      <description>&lt;P&gt;"compiler can generate code that will compute the number of loop iterations without executing the loop"&lt;/P&gt;&lt;P&gt;what kind of code? would it be acceptable to slice the code of the original loop to extract index generation and then loop that code ( and not the original loop )?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Nov 2018 19:57:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160787#M7917</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-18T19:57:49Z</dc:date>
    </item>
    <item>
      <title>That loop is not inherently</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160788#M7918</link>
      <description>&lt;P&gt;That loop is not inherently parallelizable (except in the cases of where the initial value of x is &amp;lt;=0.0, and in that case replace the loop with x=x*(2**cnt)). Otherwise, x has loop order dependencies.&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 13:55:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160788#M7918</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T13:55:00Z</dc:date>
    </item>
    <item>
      <title>Edit: &lt;= 1.0 / (2**cnt)</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160789#M7919</link>
      <description>&lt;P&gt;Edit: &amp;lt;= 1.0 / (2**cnt)&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 16:13:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160789#M7919</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T16:13:18Z</dc:date>
    </item>
    <item>
      <title>Although that is not a point</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160790#M7920</link>
      <description>&lt;P&gt;Although that is not a point of my post, but may be in the case you mentioned the result shall be: x*(2**(cnt/3)).&lt;/P&gt;&lt;P&gt;Also, when you say that loop is not parallelizabe, I guess you mean "not parallelizable by OpenMP". I wrote an autoparallelizer ( see &lt;A href="https://community.intel.com/www.dalsoft.com" target="_blank"&gt;www.dalsoft.com&lt;/A&gt; ) that can parallelize this loop ( in fact, much more complicated loop - the code in my post is a simplified version of that loop ).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 16:42:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160790#M7920</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-19T16:42:25Z</dc:date>
    </item>
    <item>
      <title>It would help if you give a</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160791#M7921</link>
      <description>&lt;P&gt;It would help if you give a link to the direct page that illustrates how the loop in #1 is auto-parallelized.&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 18:23:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160791#M7921</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T18:23:51Z</dc:date>
    </item>
    <item>
      <title>The loop that I refered to (</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160792#M7922</link>
      <description>&lt;P&gt;The loop that I refered to ( of which the code in my post is a simplified version ) is:&lt;/P&gt;&lt;P&gt;for ( i = 0; i &amp;lt; cnt; )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = exp( x1 ) * cos( x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;I will present the results on the "Compiler, Architecture, And Tools Conference", see&lt;/P&gt;&lt;P&gt;&lt;A href="https://software.intel.com/en-us/event/compiler-conference/2018/schedule" target="_blank"&gt;https://software.intel.com/en-us/event/compiler-conference/2018/schedule&lt;/A&gt;&lt;/P&gt;&lt;P&gt;After the presentation ( December 17 ) would be glad to show you the code.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 18:50:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160792#M7922</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-19T18:50:53Z</dc:date>
    </item>
    <item>
      <title>Please add this to your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160793#M7923</link>
      <description>&lt;P&gt;Please add this to your calendar such that those here&amp;nbsp;not attending the conference can see and comment.&lt;/P&gt;&lt;P&gt;While having a compiler auto-parallelize the #9 loop can be parallelized.&lt;/P&gt;&lt;P&gt;In the serial loop,&amp;nbsp;i always increments, and thus will not produce duplicate indices for &lt;I&gt;. While is has not been disclosed, it may be a requirement that the initial x be &amp;gt; 0.0. Therefor the values inserted into b or a would not be 0.0&lt;/I&gt;&lt;/P&gt;&lt;P&gt;This untested code may be effective:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;atomic&amp;lt;double&amp;gt; fix_a[cnt], fix_b[cnt];
atomic&amp;lt;int&amp;gt; fill_a,fill_b;
...

#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i &amp;lt; cnt; ++i)
      a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      b&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_b&lt;I&gt; = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_a = -1;
fill_b = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i &amp;lt; cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 &amp;lt;  1.0 )
      {
        fix_b&lt;I&gt; = x1; // b&lt;I&gt; = exp( x1 ) * cos( x1 );
        fill_b = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 &amp;gt;= 1. )
      {
        fix_a&lt;I&gt; = x1; // a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
        fill_a = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i &amp;lt; cnt; )
    fill_a = cnt;
    fill_b = cnt;
  } // #pragma omp master
  // all threads here
  int empty_a = 0;
  int empty_b = 0;
  // until done
  for(;empty_a &amp;lt; cnt || empty_b &amp;lt; cnt;)
  {
    while(empty_a &amp;lt;= fill_a &amp;amp;&amp;amp; empty_a &amp;lt; cnt)
    {
      if(fix_a[empty_a] != 0.0)
      {
        double x1 = fix_a[empty_a].exchange(0.0);
        if(x1)
        {
          a[empty_a] = sqrt( x1 ) * log( 1 / x1 );
        } // if(x1)
      } // if(fix_a[empty_a])
      ++empty_a;
    }
    while(empty_b &amp;lt;= fill_b &amp;amp;&amp;amp; empty_b &amp;lt; cnt)
    {
      if(fix_b[empty_b] != 0.0)
      {
        double x1 = fix_b[empty_b].exchange(0.0);
        if(x1)
        {
          b[empty_b] = exp( x1 ) * cos( x1 );
        } // if(x1)
      } // if(fix_b[empty_b])
      ++empty_b;
    }
  } // for(;empty_a &amp;lt; cnt || empty_b &amp;lt; cnt;)
} // #pragma parallel&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Depending on your needs, you may want to insert _mm_pause() when waiting for work.&lt;/P&gt;
&lt;P&gt;Keep in mind you may need to modify the code.&lt;/P&gt;
&lt;P&gt;Also, the amount of work needs to be sufficient to amortize the overhead of starting/resuming the thread team. (IOW number of iterations is relatively large).&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 19 Nov 2018 21:03:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160793#M7923</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-19T21:03:00Z</dc:date>
    </item>
    <item>
      <title>It should be noted that the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160794#M7924</link>
      <description>&lt;P&gt;It should be noted that the wipe of fix_a and fix_b need only be done once due to the pickers resetting to 0.0 with exchange. As to what to do with a and b there is insufficient information in your postings.&lt;/P&gt;&lt;P&gt;You would want to assure that the threads performing the picking were on separate cores (IOW not with multiple threads within a core).&lt;/P&gt;&lt;P&gt;Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 20 Nov 2018 15:22:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160794#M7924</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-20T15:22:49Z</dc:date>
    </item>
    <item>
      <title>As to what to do with a and b</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160795#M7925</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;As to what to do with a and b there is insufficient information in your postings.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;a and b assumed to be parameters to the routine that contains the loop.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;No, I came up with this in an attempt to show the functionality of the auto parallelizer, specifically the ability to&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;calculate loop count ( of not a countable loop, thus not supported by OpenMP )&lt;/LI&gt;&lt;LI&gt;resolve memory dependency for memory writes a and b&lt;/LI&gt;&lt;LI&gt;create code to calculate values needed at the entry to a thread: x and i&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;If you would like, I will send you ( after the conference ) my presentation that explains the above.&lt;/P&gt;&lt;P&gt;As to the iteration counts: in the test suite I use, the iteration count cnt is set to be 100000000 ( 8 zeros ) - which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.&lt;/P&gt;&lt;P&gt;Also note the use of transcendentals - this is done in order to give some weight to the loop; otherwise the overhead of doing the above will make auto-parallelization to be not worth it. I ran some tests adding more calls to "expensive" routines and saw how it improves the performance of the parallelized code. You may find the following article helpful to clarify that:&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_parallelization.pdf" target="_blank"&gt;http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_parallelization.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class="s3gt_translate_tooltip_mini_box" id="s3gt_translate_tooltip_mini" is_bottom="true" is_mini="true" style="background:initial !important; border-collapse:initial !important; border-radius:initial !important; border-spacing:initial !important; border:initial !important; box-sizing:initial !important; color:inherit !important; direction:ltr !important; display:initial !important; flex-direction:initial !important; font-family:X-LocaleSpecific,sans-serif,Tahoma,Helvetica !important; font-size:13px !important; font-weight:initial !important; height:initial !important; left:433px; letter-spacing:initial !important; line-height:13px !important; margin-bottom:0px; margin-left:0px; margin-right:0px; margin-top:0px; max-height:initial !important; max-width:initial !important; min-height:initial !important; min-width:initial !important; opacity:0.85; outline:initial !important; overflow-wrap:initial !important; padding:initial !important; position:absolute; table-layout:initial !important; text-align:initial !important; text-shadow:initial !important; top:53px; vertical-align:top !important; white-space:inherit !important; width:initial !important; word-break:initial !important; word-spacing:initial !important"&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_logo" title="Translate selected text"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_sound" title="Play" title_play="Play" title_stop="Stop"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_copy" title="Copy text to Clipboard"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;
&lt;LINK href="moz-extension://3143799d-314a-4d30-af27-8775c53b6a6e/skin/s3gt_tooltip_mini.css" rel="stylesheet" type="text/css" /&gt;
&lt;STYLE media="print" type="text/css"&gt;#s3gt_translate_tooltip_mini { display: none !important; }
&lt;/STYLE&gt;</description>
      <pubDate>Tue, 20 Nov 2018 16:19:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160795#M7925</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-20T16:19:06Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;which makes me wonder how</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160796#M7926</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.&lt;/P&gt;&lt;P&gt;To reduce the additional arrays to 1 array, it is known that the X1's generated are all &amp;gt; 0.0. Therefore the sign could be used to indicate the path.&lt;/P&gt;&lt;P&gt;As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).&lt;/P&gt;&lt;P&gt;Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this. This said, one (you) could have the compiler identify this type of loop (something similar to a convergence loop), and produce a preamble for single path, then enter the flip/flop for the remainder of the convergence.&lt;/P&gt;&lt;P&gt;Simplified code of my prior post:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;atomic&amp;lt;double&amp;gt; fix_x[cnt];
atomic&amp;lt;int&amp;gt; fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i &amp;lt; cnt; ++i)
      a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      b&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_x&lt;I&gt; = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i &amp;lt; cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 &amp;lt;  1.0 )
      {
        fix_x&lt;I&gt; = -x1; // b&lt;I&gt; = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 &amp;gt;= 1. )
      {
        fix_x&lt;I&gt; = x1; // a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i &amp;lt; cnt; )
    fill_x = cnt;
  } // #pragma omp master
  // all threads here
  int empty_x = 0;
  // until done
  for(;empty_x &amp;lt; cnt;)
  {
    while(empty_x &amp;lt;= fill_x)
    {
      if(fix_x[empty_x] != 0.0)
      {
        double x1 = fix_x[empty_x].exchange(0.0);
        if(x1)
        {
          if(x1 &amp;gt; 0.0)
            a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
          else
            b[empty_b] = exp( -x1 ) * cos( -x1 );
        } // if(x1)
      } // if(fix_a[empty_x])
      ++empty_x;
    }
  } // for(;empty_x &amp;lt; cnt;)
} // #pragma parallel&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Perhaps you could compare the above with your auto-generated code.&lt;/P&gt;
&lt;P&gt;Note, if the distribution of the modified cells is somewhat random, then this code may be better:&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;atomic&amp;lt;double&amp;gt; fix_x[cnt];
atomic&amp;lt;int&amp;gt; fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i &amp;lt; cnt; ++i)
      a&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      b&lt;I&gt; = 0.0;
    #pragma omp section
    for (int i = 0; i &amp;lt; cnt; ++i)
      fix_x&lt;I&gt; = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  if(iThread == 0)
  {
    for (int i = 0; i &amp;lt; cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 &amp;lt;  1.0 )
      {
        fix_x&lt;I&gt; = -x1; // b&lt;I&gt; = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 &amp;gt;= 1. )
      {
        fix_x&lt;I&gt; = x1; // a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i &amp;lt; cnt; )
    fill_x = cnt;
  } // if(iThread == 0)
  if(nThreads &amp;gt; 1)
  {
    --iThread;
    --nThreads;
  }
  if(iThread &amp;gt;=0)
  {
    int empty_x = 0;
    // until done
    for(;empty_x &amp;lt; cnt;)
    {
      while(empty_x &amp;lt;= fill_x)
      {
        if(empty_x % nThreads == iThread)
        {
          double x1 = fix_x[empty_x];
          if(x1 != 0.0)
          {
            fix_x[empty_x] = 0.0;
            if(x1 &amp;gt; 0.0)
              a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
            else
              b[empty_b] = exp( -x1 ) * cos( -x1 );
          } // if(x1 != 0.0)
        } // if(empty_x % nThreads == iThread)
        ++empty_x;
      }
    } // for(;empty_x &amp;lt; cnt;)
  }
} // #pragma parallel&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 20 Nov 2018 17:08:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160796#M7926</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-20T17:08:00Z</dc:date>
    </item>
    <item>
      <title>As to which is faster (your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160797#M7927</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Your code wouldn't run on my machine as a and b, with the cnt as I specified, take all the memory. The rule for writing parallel code is that everything shall be done in-place, no huge memory allocations as it may be no memory available.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;I fail to understand your preoccupation with the value of x. Autoparallelizer is not bothered by that at all. Of course user should be careful to use the values of x that don't cause the exception(s), but the same exception will occur in the sequential and parallel codes. Also, should it be clear how to parallelize sequential code, I would be out of business.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class="s3gt_translate_tooltip_mini_box" id="s3gt_translate_tooltip_mini" is_mini="true" style="background:initial !important; border-collapse:initial !important; border-radius:initial !important; border-spacing:initial !important; border:initial !important; box-sizing:initial !important; color:inherit !important; direction:ltr !important; display:initial !important; flex-direction:initial !important; font-family:X-LocaleSpecific,sans-serif,Tahoma,Helvetica !important; font-size:13px !important; font-weight:initial !important; height:initial !important; left:369px; letter-spacing:initial !important; line-height:13px !important; margin-bottom:0px; margin-left:0px; margin-right:0px; margin-top:0px; max-height:initial !important; max-width:initial !important; min-height:initial !important; min-width:initial !important; opacity:0.75; outline:initial !important; overflow-wrap:initial !important; padding:initial !important; position:absolute; table-layout:initial !important; text-align:initial !important; text-shadow:initial !important; top:258px; vertical-align:top !important; white-space:inherit !important; width:initial !important; word-break:initial !important; word-spacing:initial !important"&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_logo" title="Translate selected text"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_sound" title="Play" title_play="Play" title_stop="Stop"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="s3gt_translate_tooltip_mini" id="s3gt_translate_tooltip_mini_copy" title="Copy text to Clipboard"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;
&lt;LINK href="moz-extension://3143799d-314a-4d30-af27-8775c53b6a6e/skin/s3gt_tooltip_mini.css" rel="stylesheet" type="text/css" /&gt;
&lt;STYLE media="print" type="text/css"&gt;#s3gt_translate_tooltip_mini { display: none !important; }
&lt;/STYLE&gt;</description>
      <pubDate>Tue, 20 Nov 2018 17:46:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160797#M7927</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-20T17:46:50Z</dc:date>
    </item>
    <item>
      <title>Ok, then here is simplified</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160798#M7928</link>
      <description>&lt;P&gt;Ok, then here is simplified parallel loop in OpenMP&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#pragma omp parallel
{
&amp;nbsp; int iThread = omp_get_thread_num();
&amp;nbsp; int nThreads = omp_get_num_threads();
&amp;nbsp; int interval = 0;
&amp;nbsp; // all threads perform
&amp;nbsp;&amp;nbsp; for (int i = 0; i &amp;lt; cnt; ++interval)
&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1.; // * is relatively fast
&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(iterval%nThreads == iThread)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = exp( x1 ) * cos( x1 ); //
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(interval%nThreads == iThread)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp; }
}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 20 Nov 2018 18:25:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160798#M7928</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-20T18:25:00Z</dc:date>
    </item>
    <item>
      <title>To the best of my knowledge,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160799#M7929</link>
      <description>&lt;P&gt;To the best of my knowledge, the loop in your last post is not canonical and therefore wouldn't be accepted by OpenMP; gcc should give compilation error ( when using -fopenmp ) requesting explicit loop increment.&lt;/P&gt;</description>
      <pubDate>Wed, 21 Nov 2018 06:03:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160799#M7929</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-21T06:03:38Z</dc:date>
    </item>
    <item>
      <title>Please disregard the last</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160800#M7930</link>
      <description>&lt;P&gt;Please disregard the last post - I missed the fact that "#pragma omp" applied to a block ( and not to a loop ).&lt;/P&gt;&lt;P&gt;Sorry.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Nov 2018 19:58:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160800#M7930</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-21T19:58:16Z</dc:date>
    </item>
    <item>
      <title>Here is the code that works,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160801#M7931</link>
      <description>&lt;P&gt;Here is the code that works, however, the compute time of the "DoWork" emulation is rather short.&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;// GoofyLoop.cpp
//

#include "stdafx.h"
#include &amp;lt;iostream&amp;gt;

#include &amp;lt;immintrin.h&amp;gt;
#include &amp;lt;math.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

const __int64 N = 100000000;
double* a;
double* b;
const double typicalX = 3.141592653589793;

void Serial(void)
{
	double x = typicalX;
	double x1;
	__int64 cnt = N;
	for (__int64 i = 0; i &amp;lt; cnt;)
	{
		x1 = 2.0 * x - 1.;
		if (x1 &amp;lt;  1.0)
		{
			b&lt;I&gt; = exp(x1) * cos(x1);
			i = i + 3;
			x = x*2.;
		}
		else  // if ( x1 &amp;gt;= 1. )
		{
			a&lt;I&gt; = sqrt(x1) * log(1 / x1);
			i = i + 2;
			x = x / 2.;
		}
	}
}

void Parallel(void)
{
#pragma omp parallel
	{
		int iThread = omp_get_thread_num();
		int nThreads = omp_get_num_threads();
		double x = typicalX;
		double x1;
		__int64 cnt = N;
		__int64 interval = 0;
		for (__int64 i = 0; i &amp;lt; cnt; ++interval)
		{
			x1 = 2.0 * x - 1.;
			if (x1 &amp;lt; 1.0)
			{
				if (interval%nThreads == iThread)
				{
					b&lt;I&gt; = exp(x1) * cos(x1);
				}
				i = i + 3;
				x = x*2.;
			}
			else  // if ( x1 &amp;gt;= 1. )
			{
				if (interval%nThreads == iThread)
				{
					a&lt;I&gt; = sqrt(x1) * log(1 / x1);
				}
				i = i + 2;
				x = x / 2.;
			}
		}
	}
}
int _tmain(int argc, _TCHAR* argv[])
{
	a = (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
	b = (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
#pragma omp parallel
	{
#pragma omp master
		{
			std::cout &amp;lt;&amp;lt; "nThreads = " &amp;lt;&amp;lt; omp_get_num_threads() &amp;lt;&amp;lt; std::endl;
		}
	}
#pragma omp parallel for
	for (int i = 0; i &amp;lt; N; ++i)
	{
		a&lt;I&gt; = 0.0;
		b&lt;I&gt; = 0.0;
	}

	for (int rep = 0; rep &amp;lt; 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Serial();
		unsigned __int64 t1 = _rdtsc();
		std::cout &amp;lt;&amp;lt; "Serial ticks = " &amp;lt;&amp;lt; t1 - t0 &amp;lt;&amp;lt; std::endl;
	}
	std::cout &amp;lt;&amp;lt; std::endl;
	for (int rep = 0; rep &amp;lt; 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Parallel();
		unsigned __int64 t1 = _rdtsc();
		std::cout &amp;lt;&amp;lt; "Parallel ticks = " &amp;lt;&amp;lt; t1 - t0 &amp;lt;&amp;lt; std::endl;
	}
	return 0;
}&lt;/I&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;On a 4 core w/ HT, Core i7 2700K, running 1 thread per core:&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;Threads = 4
Serial ticks = 3076054566
Serial ticks = 2543030436
Serial ticks = 2547671985

Parallel ticks = 2116263348
Parallel ticks = 2116889788
Parallel ticks = 2128250491&lt;/PRE&gt;

&lt;P&gt;Marginal.&lt;/P&gt;
&lt;P&gt;3 Threads:&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;hreads = 3
Serial ticks = 2585388714
Serial ticks = 2603521131
Serial ticks = 2602569400

Parallel ticks = 2098115233
Parallel ticks = 2108614224
Parallel ticks = 2098695600&lt;/PRE&gt;

&lt;P&gt;Slightly better.&lt;/P&gt;
&lt;P&gt;The Do Work section is relatively small computation between memory writes, therefore it seems that for this example, the degree of (productive) parallelization is dependent upon the memory subsystem.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 22 Nov 2018 03:24:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160801#M7931</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-22T03:24:24Z</dc:date>
    </item>
    <item>
      <title>I created the following test</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160802#M7932</link>
      <description>&lt;P&gt;I created the following test program:&lt;/P&gt;&lt;P&gt;#include &amp;lt;stdio.h&amp;gt;&lt;BR /&gt;#include &amp;lt;math.h&amp;gt;&lt;/P&gt;&lt;P&gt;#define N 100000000&lt;/P&gt;&lt;P&gt;double a&lt;N&gt;, b&lt;N&gt;;&lt;/N&gt;&lt;/N&gt;&lt;/P&gt;&lt;P&gt;double foo( double *a, double *b, double x, unsigned int cnt )&lt;BR /&gt;&amp;nbsp;{&lt;BR /&gt;&amp;nbsp; double x1;&lt;BR /&gt;&amp;nbsp; unsigned int i;&lt;/P&gt;&lt;P&gt;asm( "#.dco_start" );&lt;/P&gt;&lt;P&gt;&amp;nbsp; for ( i = 0; i &amp;lt; cnt; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = exp( x1 ) * cos( x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = x1;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = sqrt( x1 ) * log( 1. / x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = x1;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;asm( "#.dco_end" );&lt;/P&gt;&lt;P&gt;&amp;nbsp; return x;&lt;/P&gt;&lt;P&gt;&amp;nbsp;}&lt;/P&gt;&lt;P&gt;// by Jim Dempsey&lt;BR /&gt;double foo_Jim( double *a, double *b, double x, unsigned int cnt )&lt;BR /&gt;&amp;nbsp;{&lt;BR /&gt;&amp;nbsp; double x1;&lt;BR /&gt;&amp;nbsp; unsigned int i;&lt;/P&gt;&lt;P&gt;#if 0&lt;/P&gt;&lt;P&gt;#pragma omp parallel&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp; int iThread = omp_get_thread_num();&lt;BR /&gt;&amp;nbsp; int nThreads = omp_get_num_threads();&lt;BR /&gt;&amp;nbsp; int interval = 0;&lt;BR /&gt;&amp;nbsp; // all threads perform&lt;BR /&gt;&amp;nbsp;&amp;nbsp; for ( i = 0; i &amp;lt; cnt; ++interval)&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x1 = 2.0 * x - 1.; // * is relatively fast&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if ( x1 &amp;lt;&amp;nbsp; 1.0 )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(interval%nThreads == iThread)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = exp( x1 ) * cos( x1 ); //&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 3;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x*2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&amp;nbsp; // if ( x1 &amp;gt;= 1. )&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(interval%nThreads == iThread)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // distribute computation part&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = sqrt( x1 ) * log( 1 / x1 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; i = i + 2;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x = x/2.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;}&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;#endif&lt;/P&gt;&lt;P&gt;return x;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;int main()&lt;BR /&gt;&amp;nbsp;{&lt;BR /&gt;&amp;nbsp; double rslt, rslt1;&lt;BR /&gt;&amp;nbsp; unsigned int i;&lt;/P&gt;&lt;P&gt;&amp;nbsp; for ( i = 0; i &amp;lt; N; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = 0.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = 0.;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; for( i = 0; i &amp;lt; 10; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt = foo( a, b, 3.1415, N - 100 );&lt;BR /&gt;//&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt = foo_Jim( a, b, 3.1415, N - 100 );&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/P&gt;&lt;P&gt;&amp;nbsp; rslt1 = 0.;&lt;BR /&gt;&amp;nbsp; for ( i = 0; i &amp;lt; N; i++ )&lt;BR /&gt;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt1 += a&lt;I&gt;;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rslt1 += b&lt;I&gt;;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; printf( "rslt&amp;nbsp; %f&amp;nbsp;&amp;nbsp; %f\n", rslt, rslt1 );&lt;/P&gt;&lt;P&gt;&amp;nbsp;}&lt;/P&gt;&lt;P&gt;and generated 3 executables ( altering code as necessary ):&lt;/P&gt;&lt;P&gt;serial code: loop&lt;/P&gt;&lt;P&gt;parallel code generated by my auto-parallelizer: loop_dco&lt;/P&gt;&lt;P&gt;your code: loop_Jim&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Each program was executed&amp;nbsp; 3 times on the E8600 @ 3.33GHz, 2 cores Linux machine. The execution was under "time" command ( e.g. "time ./loop" ) and reported time ( see bellow ) is 'real' time produced by the "time" command that is neither the fastest nor the slowest out of 3 executions attepmted.&lt;/P&gt;&lt;P&gt;The execution times ( in seconds ) are:&lt;/P&gt;&lt;P&gt;loop: 14.99&lt;/P&gt;&lt;P&gt;loop_dco: 11.75&lt;/P&gt;&lt;P&gt;loop_Jim: 10.48&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As you can see the program generates and prints checksums - for loop and loop_dco these chechsums always ( for every invocation ) fully agreed, loop_Jim was generating different checksums for every separate run (?).&lt;/P&gt;&lt;P&gt;Few words about the code:&lt;/P&gt;&lt;P&gt;loop_Jim assumes that a and b are not-overlaping memory regions and therefore may be use in parallel code; loop_dco doesnt make such an assumption and generates code to verify that dynamicaly at run time - overhead of up to 20%.&lt;/P&gt;&lt;P&gt;loop_jim doesnt preserve the value of x that shall be returned.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you like, I would be glad to send you my ( Linux ) executables.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Nov 2018 13:01:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160802#M7932</guid>
      <dc:creator>Livshin__David</dc:creator>
      <dc:date>2018-11-22T13:01:39Z</dc:date>
    </item>
    <item>
      <title>Knowing this is now memory</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160803#M7933</link>
      <description>&lt;P&gt;Knowing this is now memory access&amp;nbsp;bound, performing aligned allocation and organizing stores on cache lines, yields a little more improvement:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;const int CacheLineSize = 64;
const int doublesInCacheLine = CacheLineSize / sizeof(double);
...
	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
	b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double&lt;N&gt;;
...
				if ((i / doublesInCacheLine)%nThreads == iThread)
...
				if ((i / doublesInCacheLine) % nThreads == iThread)
&lt;/N&gt;&lt;/N&gt;&lt;/PRE&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;nThreads = 4
Serial ticks = 2665960090
Serial ticks = 2645705503
Serial ticks = 2551374815

Parallel ticks = 1949592124
Parallel ticks = 1957116967
Parallel ticks = 2024346334&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 22 Nov 2018 15:12:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Countable-loops-in-openMP/m-p/1160803#M7933</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-11-22T15:12:50Z</dc:date>
    </item>
  </channel>
</rss>

