topic Thanks Andrey - its my first in Software Archive

Xeon Phi Segmentation Fault Simple Offload

Dave_O_ — Mon, 07 Apr 2014 04:09:43 GMT

I have this simple matrix multiply for offload on Phi, but I get offload error (SIGSEGV) when I run the program below:

#include <stdlib.h>
#include <math.h>

void main()
{
double *a, *b, *c;
int i,j,k, ok, n=100;

   // allocated memory on the heap aligned to 64 byte boundary
   ok = posix_memalign((void**)&a, 64, n*n*sizeof(double));
   ok |= posix_memalign((void**)&b, 64, n*n*sizeof(double));
   ok |= posix_memalign((void**)&c, 64, n*n*sizeof(double));

   // initialize matrices
   for(i=0; i<n; i++)
   {
       a = (int) rand();
       b = (int) rand();
       c = 0.0;
   }

   //offload code
   #pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))

   //parallelize via OpenMP on MIC
   #pragma omp parallel for
   for( i = 0; i < n; i++ )
       for( k = 0; k < n; k++ )
           #pragma vector aligned
           #pragma ivdep
           for( j = 0; j < n; j++ )
               //c = c + a*b;
                   c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

}

What am I doing wrong?

I read a previous post that there might be a known bug in the release?

Here is the program output:

Dave_O_ — Mon, 07 Apr 2014 04:11:10 GMT

Here is the program output:

[Offload] [MIC 0] [File] matmul_offload.cpp
[Offload] [MIC 0] [Line] 19
[Offload] [MIC 0] [Tag] Tag 0
offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

>> c[in+j] = c[in+j] + a[i

jimdempseyatthecove — Mon, 07 Apr 2014 12:45:29 GMT

>> c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

n = 100
When i=1 and j=0 (start of inner loop) then c[i*n+j] is not aligned as you have so stated with #pragma vector aligned. Do not make false declarations.

Jim Dempsey

If you work with "#pragma

Andrey_Vladimirov — Mon, 07 Apr 2014 15:23:40 GMT

If you work with "#pragma vector aligned" on Xeon Phi, then, in addition to using an aligned allocator, you have to pad the inner loop dimension (in your case, "n") to a multiple of 8 in double precision or a multiple of 16 in single precision. Otherwise, as Jim Dempsey explained above, your declaration becomes false for i>0.

Thanks Andrey - its my first

Dave_O_ — Mon, 07 Apr 2014 16:40:36 GMT

Thanks Andrey - its my first Offload code for xeon phi. I usually compile code for native runs.

Could you kindly give me an example, or point me to a resource?

Much thanks

Dave

Hi Dave,

Andrey_Vladimirov — Mon, 07 Apr 2014 19:13:49 GMT

Hi Dave,

in order to fix your code you can do something like below.

A nice paper about it is http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ; . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is http://www.colfax-intl.com/nd/xeonphi/book.aspx . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.

Andrey

#include <stdlib.h>

#include <math.h>

void main()

{

double *a, *b, *c;

int i,j,k, ok, n=100;
int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );

// allocated memory on the heap aligned to 64 byte boundary

ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));

ok |= posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));

ok |= posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));

// initialize matrices

for(i=0; i<n; i++)

{

a = (int) rand();

b = (int) rand();

c = 0.0;

}

//offload code

#pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))

//parallelize via OpenMP on MIC

#pragma omp parallel for

for( i = 0; i < n; i++ )

for( k = 0; k < n; k++ )

#pragma vector aligned

#pragma ivdep

for( j = 0; j < n; j++ )

//c = c + a*b;

c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];

}

I am using that code to know

Juan_G_ — Sat, 13 Feb 2016 12:28:00 GMT

I am using that code to know if Xeon Phi has bettter perfomance that only-Xeon. I commented the instructions #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded)), #pragma vector aligned and #pragma ivdep for run on only-Xeon and uncommented that for run on Xeon-Phi but performance on only-Xeon is better than Xeon-phi, to complile I use icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul.mic -mmic for Xeon-Phi and icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul for only-Xeon. Please Could you help me with an simple example where using parallelization and vectorization Xeon-Phi performance is better than only-Xeon.