<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I'd suggest &amp;quot;#pragma vector in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042538#M4601</link>
    <description>&lt;P&gt;I'd suggest "#pragma vector aligned" right ahead of the inner for(). If you switch to AVX it would mean 32-byte aligned. Also a good alternative with Intel compilers is the&amp;nbsp; __aligned designator but other compilers will complain.&lt;/P&gt;

&lt;P&gt;As the compiler has generated code for peel and checking alignment, those pragmas would only simplify the code (and maybe the compiler messages), giving you a slight advantage in starting the loop.&lt;/P&gt;</description>
    <pubDate>Mon, 08 Sep 2014 15:27:00 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2014-09-08T15:27:00Z</dc:date>
    <item>
      <title>Loop vectorization and how to read optimization report</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042534#M4597</link>
      <description>&lt;P&gt;Hello, I have this little sample code&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;double foo(double **cache, double *prod, int iQ, int l)
{
	double FF = 0;
	for (int iP = 0; iP &amp;lt; l; ++iP) {
		const double * p = cache[iP];
		register double prod1 = prod[iP];
		for (int iP2 = 0; iP2 &amp;lt; l; ++iP2) {
			FF += prod[iP2] * p[iP2] * prod1;
		}
	}
	return FF;
}&lt;/PRE&gt;

&lt;P&gt;compiler options are:&amp;nbsp;-O3 -std=c99 -fstrict-aliasing -xSSE4.2 -align -qopt-report=5&lt;/P&gt;

&lt;P&gt;Optimization report is:&lt;/P&gt;

&lt;P&gt;Begin optimization report for: foo(double **, double *, int, int)&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; Report from: Interprocedural optimizations [ipo]&lt;/P&gt;

&lt;P&gt;INLINE REPORT: (foo(double **, double *, int, int)) [1/1=100.0%] x.c(4,1)&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; Report from: Loop nest, Vector &amp;amp; Auto-parallelization optimizations [loop, vec, par]&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	LOOP BEGIN at x.c(6,2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #25452: Original Order found to be proper, but by a close margin&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #25461: Imperfect Loop Unroll-Jammed by 2 &amp;nbsp; (pre-vector)&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #15344: loop was not vectorized: vector dependence prevents vectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15344: loop was not vectorized: vector dependence prevents vectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #25439: unrolled with remainder by 2 &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;lt;Remainder&amp;gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;BR /&gt;
	LOOP END&lt;/P&gt;

&lt;P&gt;LOOP BEGIN at x.c(6,2)&lt;BR /&gt;
	&amp;lt;Remainder&amp;gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;remark #15542: loop was not vectorized: inner loop was already vectorized&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;lt;Peeled&amp;gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15388: vectorization support: reference prod has aligned access &amp;nbsp; [ x.c(10,4) ]&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15388: vectorization support: reference p has aligned access &amp;nbsp; [ x.c(10,4) ]&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15399: vectorization support: unroll factor set to 4&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15300: LOOP WAS VECTORIZED&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15442: entire loop may be executed in remainder&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15448: unmasked aligned unit stride loads: 2&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15475: --- begin vector loop cost summary ---&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15476: scalar loop cost: 14&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15477: vector loop cost: 18.000&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15478: estimated potential speedup: 2.950&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15479: lightweight vector operations: 8&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15480: medium-overhead vector operations: 1&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15488: --- end vector loop cost summary ---&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #25460: No loop optimizations reported&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;lt;Remainder&amp;gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15389: vectorization support: reference prod has unaligned access &amp;nbsp; [ x.c(10,4) ]&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15388: vectorization support: reference p has aligned access &amp;nbsp; [ x.c(10,4) ]&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15381: vectorization support: unaligned access used inside loop body &amp;nbsp; [ x.c(10,4) ]&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; remark #15301: REMAINDER LOOP WAS VECTORIZED&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;LOOP BEGIN at x.c(9,3)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;lt;Remainder&amp;gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LOOP END&lt;BR /&gt;
	LOOP END&lt;/P&gt;

&lt;P&gt;I have two issues regarding this report:&lt;/P&gt;

&lt;P&gt;1) first it says the loop is not vectorized because of dependencies; then it says loop was vectorized. Why this disagreement?&lt;/P&gt;

&lt;P&gt;2) first&amp;nbsp;prod has aligned access, then&amp;nbsp;prod has unaligned access. Why?&lt;/P&gt;

&lt;P&gt;Any thought?&lt;/P&gt;

&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2014 12:42:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042534#M4597</guid>
      <dc:creator>selmilab</dc:creator>
      <dc:date>2014-09-08T12:42:10Z</dc:date>
    </item>
    <item>
      <title>Hello Selmilab,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042535#M4598</link>
      <description>&lt;P&gt;Hello Selmilab,&lt;/P&gt;

&lt;P&gt;Can you enter your question on the Intel C++ compiler forum at &lt;A href="https://software.intel.com/en-us/forums/intel-c-compiler"&gt;https://software.intel.com/en-us/forums/intel-c-compiler&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;You'll get quicker response.&lt;/P&gt;

&lt;P&gt;Pat&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2014 13:11:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042535#M4598</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2014-09-08T13:11:26Z</dc:date>
    </item>
    <item>
      <title>It looks like the compiler</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042536#M4599</link>
      <description>&lt;P&gt;It looks like the compiler may have generated separate versions of the inner loop, for the cases where peeling for alignment aligns both p and prod, and for other cases, so it seems there is some possibility you may not hit the vectorized loop at run time.&lt;/P&gt;

&lt;P&gt;For remainder vectorization&amp;nbsp; it didn't try to align both p and prod.&amp;nbsp; As the main vector loop takes the data 8 at a time, the compiler judged it worth while to optimize the remainder with simd.&amp;nbsp;&amp;nbsp; Possibly it may be able to use the remainder loop for cases where the fully aligned version doesn't apply.&lt;/P&gt;

&lt;P&gt;While CPUs which support AVX may be able to support both SSE4 and AVX vectorization without losing out on an unaligned operand, the SSE4.2 choice has to support the earliest SSE2 CPUs.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2014 13:11:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042536#M4599</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-09-08T13:11:56Z</dc:date>
    </item>
    <item>
      <title>Quote:Tim Prince wrote:</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042537#M4600</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim Prince wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;It looks like the compiler may have generated separate versions of the inner loop, for the cases where peeling for alignment aligns both p and prod, and for other cases, so it seems there is some possibility you may not hit the vectorized loop at run time.&lt;/P&gt;

&lt;P&gt;For remainder vectorization&amp;nbsp; it didn't try to align both p and prod.&amp;nbsp; As the main vector loop takes the data 8 at a time, the compiler judged it worth while to optimize the remainder with simd.&amp;nbsp;&amp;nbsp; Possibly it may be able to use the remainder loop for cases where the fully aligned version doesn't apply.&lt;/P&gt;

&lt;P&gt;While CPUs which support AVX may be able to support both SSE4 and AVX vectorization without losing out on an unaligned operand, the SSE4.2 choice has to support the earliest SSE2 CPUs.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;How can I tell the compiler that the parameters point to 16 byte aligned memory? I've found a directive like&amp;nbsp;__declspec(align(16)) but it doesn't work on parameters&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2014 13:36:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042537#M4600</guid>
      <dc:creator>selmilab</dc:creator>
      <dc:date>2014-09-08T13:36:00Z</dc:date>
    </item>
    <item>
      <title>I'd suggest "#pragma vector</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042538#M4601</link>
      <description>&lt;P&gt;I'd suggest "#pragma vector aligned" right ahead of the inner for(). If you switch to AVX it would mean 32-byte aligned. Also a good alternative with Intel compilers is the&amp;nbsp; __aligned designator but other compilers will complain.&lt;/P&gt;

&lt;P&gt;As the compiler has generated code for peel and checking alignment, those pragmas would only simplify the code (and maybe the compiler messages), giving you a slight advantage in starting the loop.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2014 15:27:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042538#M4601</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-09-08T15:27:00Z</dc:date>
    </item>
    <item>
      <title>Thank you all for your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042539#M4602</link>
      <description>&lt;P&gt;Thank you all for your answers. You pointed me in the right direction. To solve my issues I've allocated my arrays using&amp;nbsp;_mm_malloc and enforcing a 32 byte alignment. Also, I'm using the&amp;nbsp;__assume_aligned directive in the code that uses these arrays. Everything seems to work fine&lt;/P&gt;</description>
      <pubDate>Tue, 09 Sep 2014 13:42:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042539#M4602</guid>
      <dc:creator>selmilab</dc:creator>
      <dc:date>2014-09-09T13:42:55Z</dc:date>
    </item>
    <item>
      <title>Along the same lines, if you</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042540#M4603</link>
      <description>&lt;P&gt;Along the same lines, if you compile for OpenMP you will probably go back to generating multiple versions of the loops again, since the starting points for each OpenMP thread are not known until run-time when the OMP_NUM_THREADS variable is available to determine the distribution of data elements to threads.&lt;/P&gt;

&lt;P&gt;(I think that the OpenMP SIMD directive(s) are intended to help the OpenMP compiler to maintain alignment even with an arbitrary number of threads, but I have not played with that (relatively new) feature yet.)&lt;/P&gt;

&lt;P&gt;Given multiple versions of the loop you then want to know which one(s) are actually being executed.&amp;nbsp; The easiest way to figure this out is probably with Intel's Amplifier XE (VTune) profiling -- you can click on the hot spots in the GUI to drill down to the assembly code to figure out which version(s) are being executed most of the time.&amp;nbsp;&amp;nbsp; Once you are pointed at the "hot" assembly code, it is typically pretty easy to see how the computations are vectorized and whether the memory accesses are assumed to be aligned.&amp;nbsp;&amp;nbsp;&amp;nbsp; Understanding *why* is sometimes harder, but that is all part of the fun!&lt;/P&gt;</description>
      <pubDate>Tue, 09 Sep 2014 16:30:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042540#M4603</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-09-09T16:30:37Z</dc:date>
    </item>
    <item>
      <title>#pragma omp simd aligned (the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042541#M4604</link>
      <description>&lt;P&gt;#pragma omp simd aligned (the OpenMP 4 vectorization pragma) offers some functionality equivalent to Intel proprietary __assume_aligned.&lt;/P&gt;

&lt;P&gt;As John mentioned, the situation gets complicated with OpenMP parallel (threading).&amp;nbsp; As far as I know, in practice it's not possible to assert alignment unless the product of number of threads times hardware simd vector width matches the total loop count.&amp;nbsp; This is one of the reasons why OpenMP parallel is more effective in the situation of outer loop parallel inner loop vector than in the case where a single loop is to be compiled as both threaded and simd parallel (a situation supported by Intel compilers with #pragma omp parallel for simd).&amp;nbsp; If any thread has to take the time to process misalignment, all threads may as well do so.&lt;/P&gt;

&lt;P&gt;Even with Intel compilers, there is a question about the degree of support for multiple OpenMP 4 clauses.&amp;nbsp;&amp;nbsp; gcc seems less likely to produce crashes when more clauses are added, but also less likely to do anything useful with additional clauses.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2014 12:16:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042541#M4604</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-09-10T12:16:00Z</dc:date>
    </item>
    <item>
      <title>Thank you both for pointing</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042542#M4605</link>
      <description>&lt;P&gt;Thank you both for pointing me out these issues with OpenMP, issues I've never considered.&lt;/P&gt;

&lt;P&gt;The code snippet is at the end of a long function call tree. At the top of this tree there is a for loop which is parallelized via the usual #pragma omp parallel for but the two loops of the snippet are supposed to be executed sequentially by each thread.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2014 13:12:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Loop-vectorization-and-how-to-read-optimization-report/m-p/1042542#M4605</guid>
      <dc:creator>selmilab</dc:creator>
      <dc:date>2014-09-10T13:12:05Z</dc:date>
    </item>
  </channel>
</rss>

