<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thanks Andrey - its my first in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972893#M24397</link>
    <description>&lt;P&gt;Thanks Andrey - its my first Offload code for xeon phi. I usually compile code for native runs.&lt;/P&gt;

&lt;P&gt;Could you kindly give me an example, or point me to a resource?&lt;/P&gt;

&lt;P&gt;Much thanks&lt;/P&gt;

&lt;P&gt;Dave&lt;/P&gt;</description>
    <pubDate>Mon, 07 Apr 2014 16:40:36 GMT</pubDate>
    <dc:creator>Dave_O_</dc:creator>
    <dc:date>2014-04-07T16:40:36Z</dc:date>
    <item>
      <title>Xeon Phi Segmentation Fault Simple Offload</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972889#M24393</link>
      <description>&lt;P&gt;I have this simple matrix multiply for offload on Phi, but I get &amp;nbsp;offload error (SIGSEGV) when I run the program below:&lt;/P&gt;

&lt;P&gt;#include &amp;lt;stdlib.h&amp;gt;&lt;BR /&gt;
	#include &amp;lt;math.h&amp;gt;&lt;/P&gt;

&lt;P&gt;void main()&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;double *a, *b, *c;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;int i,j,k, ok, n=100;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;// allocated memory on the heap aligned to 64 byte boundary&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;ok = posix_memalign((void**)&amp;amp;a, 64, n*n*sizeof(double));&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;ok |= posix_memalign((void**)&amp;amp;b, 64, n*n*sizeof(double));&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;ok |= posix_memalign((void**)&amp;amp;c, 64, n*n*sizeof(double));&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;// initialize matrices&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;for(i=0; i&amp;lt;n; i++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;a&lt;I&gt; = (int) rand();&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;b&lt;I&gt; = (int) rand();&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;c&lt;I&gt; = 0.0;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;//offload code&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;//parallelize via OpenMP on MIC&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;#pragma omp parallel for&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;for( i = 0; i &amp;lt; n; i++ )&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;for( k = 0; k &amp;lt; n; k++ )&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;#pragma vector aligned&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;#pragma ivdep&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;for( j = 0; j &amp;lt; n; j++ )&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;//c&lt;I&gt;&lt;J&gt; = c&lt;I&gt;&lt;J&gt; + a&lt;I&gt;&lt;K&gt;*b&lt;K&gt;&lt;J&gt;;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;}&lt;/P&gt;

&lt;P&gt;What am I doing wrong?&lt;/P&gt;

&lt;P&gt;I read a previous post that there might be a known bug in the release?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2014 04:09:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972889#M24393</guid>
      <dc:creator>Dave_O_</dc:creator>
      <dc:date>2014-04-07T04:09:43Z</dc:date>
    </item>
    <item>
      <title>Here is the program output:</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972890#M24394</link>
      <description>&lt;P&gt;Here is the program output:&lt;/P&gt;

&lt;P&gt;[Offload] [MIC 0] [File] &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;matmul_offload.cpp&lt;BR /&gt;
	[Offload] [MIC 0] [Line] &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;19&lt;BR /&gt;
	[Offload] [MIC 0] [Tag] &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Tag 0&lt;BR /&gt;
	offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2014 04:11:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972890#M24394</guid>
      <dc:creator>Dave_O_</dc:creator>
      <dc:date>2014-04-07T04:11:10Z</dc:date>
    </item>
    <item>
      <title> &gt;&gt; c[i*n+j] = c[i*n+j] + a[i</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972891#M24395</link>
      <description>&lt;P&gt;&amp;nbsp;&amp;gt;&amp;gt; c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];&lt;/P&gt;

&lt;P&gt;n = 100&lt;BR /&gt;
	When i=1 and&amp;nbsp;j=0 (start of inner loop) then c[i*n+j] is not aligned as you have so stated with #pragma vector aligned. Do not make false declarations.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2014 12:45:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972891#M24395</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-04-07T12:45:29Z</dc:date>
    </item>
    <item>
      <title>If you work with "#pragma</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972892#M24396</link>
      <description>&lt;P&gt;If you work with "#pragma vector aligned" on Xeon Phi, then, in addition to using an aligned allocator, you have to pad the inner loop dimension (in your case, "n") to a multiple of 8 in double precision or a multiple of 16 in single precision. Otherwise, as Jim Dempsey explained above, your declaration becomes false for i&amp;gt;0.&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2014 15:23:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972892#M24396</guid>
      <dc:creator>Andrey_Vladimirov</dc:creator>
      <dc:date>2014-04-07T15:23:40Z</dc:date>
    </item>
    <item>
      <title>Thanks Andrey - its my first</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972893#M24397</link>
      <description>&lt;P&gt;Thanks Andrey - its my first Offload code for xeon phi. I usually compile code for native runs.&lt;/P&gt;

&lt;P&gt;Could you kindly give me an example, or point me to a resource?&lt;/P&gt;

&lt;P&gt;Much thanks&lt;/P&gt;

&lt;P&gt;Dave&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2014 16:40:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972893#M24397</guid>
      <dc:creator>Dave_O_</dc:creator>
      <dc:date>2014-04-07T16:40:36Z</dc:date>
    </item>
    <item>
      <title>Hi Dave,</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972894#M24398</link>
      <description>&lt;P&gt;Hi Dave,&lt;/P&gt;

&lt;P&gt;in order to fix your code you can do something like below.&lt;/P&gt;

&lt;P&gt;A nice paper about it is &lt;A href="http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization&amp;nbsp;" target="_blank"&gt;http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization&amp;nbsp;&lt;/A&gt;; . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is &lt;A href="http://www.colfax-intl.com/nd/xeonphi/book.aspx" target="_blank"&gt;http://www.colfax-intl.com/nd/xeonphi/book.aspx&lt;/A&gt; . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.&lt;/P&gt;

&lt;P&gt;Andrey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;PRE&gt;&lt;/PRE&gt;&lt;/P&gt;

&lt;P&gt;#include &amp;lt;stdlib.h&amp;gt;&lt;/P&gt;

&lt;P&gt;#include &amp;lt;math.h&amp;gt;&lt;/P&gt;

&lt;P&gt;void main()&lt;/P&gt;

&lt;P&gt;{&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; double *a, *b, *c;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; int i,j,k, ok, n=100;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp; int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // allocated memory on the heap aligned to 64 byte boundary&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ok = posix_memalign((void**)&amp;amp;a, 64, n*nPadded*sizeof(double));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ok |= posix_memalign((void**)&amp;amp;b, 64, n*nPadded*sizeof(double));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ok |= posix_memalign((void**)&amp;amp;c, 64, n*nPadded*sizeof(double));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // initialize matrices&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for(i=0; i&amp;lt;n; i++)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; a&lt;I&gt; = (int) rand();&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; b&lt;I&gt; = (int) rand();&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c&lt;I&gt; = 0.0;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; //offload code&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; //parallelize via OpenMP on MIC&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma omp parallel for&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for( i = 0; i &amp;lt; n; i++ )&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for( k = 0; k &amp;lt; n; k++ )&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma vector aligned&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma ivdep&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for( j = 0; j &amp;lt; n; j++ )&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; //c&lt;I&gt;&lt;J&gt; = c&lt;I&gt;&lt;J&gt; + a&lt;I&gt;&lt;K&gt;*b&lt;K&gt;&lt;J&gt;;&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];&lt;/P&gt;

&lt;P&gt;}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2014 19:13:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972894#M24398</guid>
      <dc:creator>Andrey_Vladimirov</dc:creator>
      <dc:date>2014-04-07T19:13:49Z</dc:date>
    </item>
    <item>
      <title>I am using that code to know</title>
      <link>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972895#M24399</link>
      <description>&lt;P&gt;I am using that code to know if Xeon Phi has bettter perfomance that only-Xeon.&amp;nbsp; I commented the instructions #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded)), #pragma vector aligned and #pragma ivdep for run on only-Xeon and uncommented that for run on Xeon-Phi but performance on only-Xeon is better than Xeon-phi, to complile I use icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul.mic -mmic for Xeon-Phi and icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul for only-Xeon. Please Could you help me with an simple example where using parallelization and vectorization Xeon-Phi performance is better than only-Xeon.&lt;/P&gt;</description>
      <pubDate>Sat, 13 Feb 2016 12:28:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Xeon-Phi-Segmentation-Fault-Simple-Offload/m-p/972895#M24399</guid>
      <dc:creator>Juan_G_</dc:creator>
      <dc:date>2016-02-13T12:28:00Z</dc:date>
    </item>
  </channel>
</rss>

