<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OpenMP guidance in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951024#M5165</link>
    <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Hi,&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;I am developing a neural network package using Intel C++ compiler 9.0. The code is so parallel that it is a no brainer to use OpenMP. The problem is to know when not to use it.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Most of my code is vector operations (dot product, vector add, scaling etc.). What I am looking for is some guidance as to when it becomes detrimental to parallelize - for example, it is probably worth parallelizing A.B if dimensionality of vectors A &amp;amp; B is 10^6. But should I parallelize such loops when I expect the typical dimensionality to be 100 or 1000 or 10000?&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;I would appreciate it if anyone can provide guidance on this.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Btw, I noticed (after spending 2 days tearing my hair out :) that Intel optimizer (/O3 /Qip) does not perform scalar replacement in loops nested inside parallelized loops - it had really slowed my application down.&lt;/DIV&gt;
&lt;DIV&gt;Also, is to to be expected that parallelized loops are not vectorized? I would have thought it should be possible with static scheduling at least.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Many thanks&lt;/DIV&gt;
&lt;DIV&gt;P&lt;/DIV&gt;</description>
    <pubDate>Tue, 17 Jan 2006 21:00:17 GMT</pubDate>
    <dc:creator>perceptron</dc:creator>
    <dc:date>2006-01-17T21:00:17Z</dc:date>
    <item>
      <title>OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951024#M5165</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Hi,&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;I am developing a neural network package using Intel C++ compiler 9.0. The code is so parallel that it is a no brainer to use OpenMP. The problem is to know when not to use it.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Most of my code is vector operations (dot product, vector add, scaling etc.). What I am looking for is some guidance as to when it becomes detrimental to parallelize - for example, it is probably worth parallelizing A.B if dimensionality of vectors A &amp;amp; B is 10^6. But should I parallelize such loops when I expect the typical dimensionality to be 100 or 1000 or 10000?&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;I would appreciate it if anyone can provide guidance on this.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Btw, I noticed (after spending 2 days tearing my hair out :) that Intel optimizer (/O3 /Qip) does not perform scalar replacement in loops nested inside parallelized loops - it had really slowed my application down.&lt;/DIV&gt;
&lt;DIV&gt;Also, is to to be expected that parallelized loops are not vectorized? I would have thought it should be possible with static scheduling at least.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Many thanks&lt;/DIV&gt;
&lt;DIV&gt;P&lt;/DIV&gt;</description>
      <pubDate>Tue, 17 Jan 2006 21:00:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951024#M5165</guid>
      <dc:creator>perceptron</dc:creator>
      <dc:date>2006-01-17T21:00:17Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951025#M5166</link>
      <description>I spent half an hour trying to answer part of your questions, then the site discarded my answer, which was no doubt longer than you wanted to read.&lt;BR /&gt;I expect 9.1 to improve on some of the issues you mention, particularly for 64-bit mode.  It does make a difference which platforms you are interested in.&lt;BR /&gt;Did you try -O1, to reduce aggressiveness of unrolling in vectorized loops?  Vectorized dot products are batched into 8 parallel sums at -O2.&lt;BR /&gt;Can you assure that your vectors are 16-byte aligned, and that the compiler knows this?</description>
      <pubDate>Tue, 17 Jan 2006 23:49:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951025#M5166</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2006-01-17T23:49:15Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951026#M5167</link>
      <description>I suppose OpenMP parallelization is most efficient when the data sets allocated to the threads approach a full page size apart (4KB for Xeon).  Parallel vectorized code is likely to use the full capacity of the memory system, and you would like to avoid duplicate DTLB and cache filling between the threads.</description>
      <pubDate>Wed, 18 Jan 2006 01:00:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951026#M5167</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2006-01-18T01:00:16Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951027#M5168</link>
      <description>Hi Tim,&lt;BR /&gt;&lt;BR /&gt;Thanks for the reply. I am using the IA32 compiler on XP. Largest vectors that I have tested my system with are about 50K long, and are single precision floats. My development machine is a dual core PentiumD.&lt;BR /&gt;I take care to keep things aligned (I am assuming that the standard C++ new operator dishes out aligned arrays). I use option -Zp16 and and take care while accessing memory allocated on the heap so everything should be aligned to 16byte boundries. I don't tell the compiler about it though, because code would become very messy if I start putting __decl... dubris everywhere. Didn't find a compiler option to assume everything aligned.&lt;BR /&gt;I was parallelizing almost all the loops with OpenMP, but after you advice, I only parallelize the loops that suck in data (i.e. the outermost ones). These typically have an iteration count of approx. 100K.&lt;BR /&gt;Most of the stuff works at a decent speed, now that I take care to replace scalars in the inner loops myself. I was just curious about when to prefer vectorization over OpenMP.&lt;BR /&gt;&lt;BR /&gt;One thing that I would like clarified is that why does the vectorizer not like this: for(i = 0;...) x&lt;I&gt; = tanhf(y&lt;I&gt;)&lt;BR /&gt;where tanhf() is coming from the Intel math library? I remember reading somewhere that Intel have vector forms of these simple math functions. What do I need to do to access them?&lt;BR /&gt;&lt;BR /&gt;Discovered another OpenMP bug today: it doesn't compile with shared clause on orphaned 'for' directives, though there is example code in help that does this.&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Wed, 18 Jan 2006 05:08:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951027#M5168</guid>
      <dc:creator>perceptron</dc:creator>
      <dc:date>2006-01-18T05:08:02Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951028#M5169</link>
      <description>You can check the contents of libsvml yourself to see which functions are there.  Apparently, it's more difficult to write vector versions of sinh() and tanh() than it is for the inverse functions.  Among the more evident difficulties is the need to use a  method to preserve accuracy for small arguments.  If you don't care about that, you can use macro replacement:&lt;BR /&gt;#define tanhf(x) (1 - 2/(expf((x)*2) + 1))&lt;BR /&gt;to see if it's worth anything in performance.&lt;BR /&gt;&lt;BR /&gt;I agree about the dilemma posed by the performance advantage of alignment directives, and the fact that the gcc version of them isn't accepted by ICL.  We persuaded the compiler team that Fortran should align all potentially vectorizable arrays by default, rather than supporting such directives.&lt;BR /&gt;I think it's unfortunate that 32-bit tradition apparently dictates that standard new() and malloc() don't assure alignment.  Thus the provision of aligned_malloc() and the like.&lt;BR /&gt;If your loops have few enough operands, and are long enough, the overhead for taking care of various cases of alignment is tolerable.</description>
      <pubDate>Wed, 18 Jan 2006 05:55:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951028#M5169</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2006-01-18T05:55:38Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951029#M5170</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Dear Perceptron,&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman" size="2"&gt;Could you please elaborate on the problems you encountered with vectorizing the tanh() function? I see straightforward vectorization of this code (for 7.x 8.x and 9.x compilers).&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;float x[100], y[100];&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;FONT face="Times New Roman" size="2"&gt;.&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;for (i = 0; i &amp;lt; 100; i++)&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;x&lt;I&gt; = tanhf(y&lt;I&gt;);&lt;/I&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;joho.c(7) : (col. 3) remark: LOOP WAS VECTORIZED.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Aart Bik&lt;BR /&gt;&lt;/FONT&gt;&lt;A href="http://www.aartbik.com/" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.aartbik.com/" target="_blank"&gt;http://www.aartbik.com/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jan 2006 09:10:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951029#M5170</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2006-01-25T09:10:36Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951030#M5171</link>
      <description>libsvml does include the function vmldTanh4, so the compiler vectorizer should invoke that automatically in appropriate situations.</description>
      <pubDate>Wed, 25 Jan 2006 23:31:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951030#M5171</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2006-01-25T23:31:13Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951031#M5172</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Tim,&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;No such function! The appropriate function is called _vmlsTanh4() with an "s" for single-precision.The function _vmldTanh2() would be used to vectorize a double-precision version of the loop. In any case, I thought I already made it clear that this loop should vectorize in my previous message, therefore I would like to know what problems the customer encountered.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Aart Bik&lt;BR /&gt;&lt;/FONT&gt;&lt;A href="http://www.aartbik.com/" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.aartbik.com/" target="_blank"&gt;http://www.aartbik.com/&lt;/A&gt;&lt;/P&gt;
&lt;DIV&gt;&lt;/DIV&gt;&lt;P&gt;Message Edited by abik on &lt;SPAN class="date_text"&gt;01-25-2006&lt;/SPAN&gt; &lt;SPAN class="time_text"&gt;09:22 AM&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2006 01:20:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951031#M5172</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2006-01-26T01:20:32Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951032#M5173</link>
      <description>Hi tim &amp;amp; abik,&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Sorry for the delay - I had not seen the board for a while. Here is the compiler report (/O3 /Qip /Qvec-report3) on IA32:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;ActivationFunctions.inl(61) : (col. 3) remark: loop was not vectorized: contains unvectorizable statement at line 62&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I get this error for all of the loops where the code (containing tanh, exp, 1/(1 + exp())) in either of these 2 forms:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;//in-place&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;void ApplyForward(size_t I, real *z) const&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;{&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;for (size_t i = 0; i I; ++i)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;z&lt;I&gt; = tanh(z&lt;I&gt;);&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;// propagators&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;void ApplyForward(size_t I, const real *a, real *z) const&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;{&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;for (size_t i = 0; i I; ++i)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;z&lt;I&gt; = tanh(a&lt;I&gt;);&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;real is a typedef for either float or double and depending on what real is, tanh is #defined to be tanhf or tanh (see below). I have only worked with single precision stuff so far (USE_DOUBLE not defined).&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;P&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#ifdef __INTEL_COMPILER&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;MATHIMF.H&gt; &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#ifndef USE_DOUBLE&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#define tanh(x) tanhf(x)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#define sinh(x) sinhf(x)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#define cosh(x) coshf(x)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#define log(x) logf(x)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#define exp(x) expf(x)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#endif&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#else // __INTEL_COMPILER&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;MATH.H&gt; &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#define isnan(x) _isnan(x)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;#endif&lt;/MATH.H&gt;&lt;/MATHIMF.H&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Thu, 26 Jan 2006 22:33:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951032#M5173</guid>
      <dc:creator>perceptron</dc:creator>
      <dc:date>2006-01-26T22:33:08Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951033#M5174</link>
      <description>As a general rule, you should check the pre-processed code yourself, if you don't want to show it in a readable form.  The general rule for bug reports (for gcc as well as icc) is to show pre-processed source code.  mathimf.h may be intended to prevent vectorization; you will notice that Aart didn't put it in his example.  &lt;BR /&gt;In your case&lt;BR /&gt;z&lt;I&gt; = tanhf(a&lt;I&gt;);&lt;BR /&gt;you may also need to declare the arguments with restrict:&lt;BR /&gt;float *restrict z, float *restrict a&lt;BR /&gt;(with icpc option -restrict, which allows C99 restrict compatibility)&lt;BR /&gt;&lt;BR /&gt;That's another can of worms, as to how various compilers use const and restrict to facilitate optimizations such as vectorization.&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Fri, 27 Jan 2006 00:59:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951033#M5174</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2006-01-27T00:59:33Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951034#M5175</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Dear Perceptron,&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Your code was very hard to read and was missing a lot of essential parts, but I suspect you have something like this:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;C&gt; cat joho.cpp&lt;/C&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;#include &lt;STDDEF.H&gt;&lt;BR /&gt;&lt;/STDDEF.H&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;#include &lt;MATHIMF.H&gt;&lt;BR /&gt;&lt;/MATHIMF.H&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;BR /&gt;#ifndef USE_DOUBLE&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;typedef float real;&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;#define tanh(x) tanhf(x)&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;#endif&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;//in-place&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;void ApplyForward(size_t I, real *z){&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;for (size_t i = 0; i &amp;lt; I; ++i)&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;z&lt;I&gt; = tanh(z&lt;I&gt;);&lt;BR /&gt;&lt;/I&gt;&lt;/I&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;SPAN&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;}&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;BR /&gt;// propagators&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;void ApplyForward(size_t I, const real *a, real *z) {&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;for (size_t i = 0; i &amp;lt; I; ++i)&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;z&lt;I&gt; = tanh(a&lt;I&gt;);&lt;BR /&gt;&lt;/I&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman" size="2"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;But even then, I see no problems with vectorization whatsoever (going back all the way to version 7.1 of our compilers):&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;C&gt; icl -QxP joho.cpp&lt;BR /&gt;&lt;/C&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman"&gt;&lt;FONT size="2"&gt;&lt;BR /&gt;joho.cpp(12) : (col. 3) remark: LOOP WAS VECTORIZED.&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face="Times New Roman" size="2"&gt;joho.cpp(18) : (col. 3) remark: LOOP WAS VECTORIZED.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;A few comments:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;(1) Tim is right, even though const restricts certain forms of assignments, it has no impact on data dependence analysis (I just wrote a few paragraphs on that in the upcoming second edition of the Software Optimization Cookbook, see &lt;/FONT&gt;&lt;A href="http://www.intel.com/intelpress/sum_swcb2.htm" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.intel.com/intelpress/sum_swcb2.htm" target="_blank"&gt;http://www.intel.com/intelpress/sum_swcb2.htm&lt;/A&gt;&lt;FONT face="Times New Roman" size="2"&gt;). Both loops vectorize by default, but you can&lt;SPAN&gt; &lt;/SPAN&gt;avoid a runtime overlap test between a and z for the second using restrict or #pragma ivdep before the loop, making the second function slightly more efficient.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;(2) Since the alignment of the data pointed to is not known in this context, adding a __assume_aligned() or #pragma vector aligned before both loops avoid runtime peeling for alignment, making both functions slightly more efficient.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;(3) Obviously, we still dont know why your code does not vectorize. Feel free to email me the source directly (&lt;/FONT&gt;&lt;A href="mailto:aart.bik@intel.com" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;aart.bik@intel.com&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face="Times New Roman" size="2"&gt;) if you want me to investigate this further. You may also want to read online vectorization guidelines at &lt;/FONT&gt;&lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm" target="_blank"&gt;http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm&lt;/A&gt;&lt;FONT face="Times New Roman" size="2"&gt; or a more detailed description in the Software Vectorization Handbook at &lt;/FONT&gt;&lt;A href="http://www.intel.com/intelpress/sum_vmmx.htm" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.intel.com/intelpress/sum_vmmx.htm" target="_blank"&gt;http://www.intel.com/intelpress/sum_vmmx.htm&lt;/A&gt;&lt;FONT face="Times New Roman" size="2"&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Hope this helps.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;Aart Bik&lt;BR /&gt;&lt;/FONT&gt;&lt;A href="http://www.aartbik.com/" target="_blank"&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.aartbik.com/" target="_blank"&gt;http://www.aartbik.com/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 27 Jan 2006 02:06:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951034#M5175</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2006-01-27T02:06:01Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951035#M5176</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;Sorry about the formatting - the website screwed up my copy-pasted code.&lt;BR /&gt;I am including mathimf because the documentation states "To use the Intel math library, include the header file, mathimf.h, in your program".&lt;BR /&gt;&lt;BR /&gt;I don't think that using restrict will help, since the vectorizer reports that the statement containing the tanhf() call is not vectorizable. Also, it would pollute the code with arcania.&lt;BR /&gt;&lt;BR /&gt;Maybe this can shed some light: I am linking my executable with a 3rd party static library, which has been compiled with Intel compiler, but I don't know if they used mathimf. Do you think this may confuse the compiler? It is a possiblility, if the vectorization happens at link time, as I have ensured that the headers that I include don't make any reference to standard headers.&lt;BR /&gt;&lt;BR /&gt;Thanks</description>
      <pubDate>Fri, 27 Jan 2006 02:16:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951035#M5176</guid>
      <dc:creator>perceptron</dc:creator>
      <dc:date>2006-01-27T02:16:16Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP guidance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951036#M5177</link>
      <description>Automatic vectorization occurs only at compile time.  You could examine a library to determine whether it contains those svml math function calls which come from auto-vectorization.</description>
      <pubDate>Fri, 27 Jan 2006 02:50:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-guidance/m-p/951036#M5177</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2006-01-27T02:50:28Z</dc:date>
    </item>
  </channel>
</rss>

