Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Dave_O_
Beginner
114 Views

Xeon Phi Segmentation Fault Simple Offload

Jump to solution

I have this simple matrix multiply for offload on Phi, but I get  offload error (SIGSEGV) when I run the program below:

#include <stdlib.h>
#include <math.h>

void main()
{
    double *a, *b, *c; 
    int i,j,k, ok, n=100;

    // allocated memory on the heap aligned to 64 byte boundary 
    ok = posix_memalign((void**)&a, 64, n*n*sizeof(double)); 
    ok |= posix_memalign((void**)&b, 64, n*n*sizeof(double)); 
    ok |= posix_memalign((void**)&c, 64, n*n*sizeof(double));


    // initialize matrices 
    for(i=0; i<n; i++)
    {
        a = (int) rand();
        b = (int) rand();
        c = 0.0;
    }
    
    //offload code 
    #pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n)) 
    
    //parallelize via OpenMP on MIC 
    #pragma omp parallel for 
    for( i = 0; i < n; i++ ) 
        for( k = 0; k < n; k++ ) 
            #pragma vector aligned 
            #pragma ivdep 
            for( j = 0; j < n; j++ ) 
                //c = c + a*b
                    c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

}

What am I doing wrong?

I read a previous post that there might be a known bug in the release?

 

 

0 Kudos

Accepted Solutions
Andrey_Vladimirov
New Contributor III
114 Views

Hi Dave,

in order to fix your code you can do something like below.

A nice paper about it is http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ; . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is http://www.colfax-intl.com/nd/xeonphi/book.aspx . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.

Andrey

 

#include <stdlib.h>

#include <math.h>

void main()

{

    double *a, *b, *c;

    int i,j,k, ok, n=100;
    int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );

    // allocated memory on the heap aligned to 64 byte boundary

    ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));

 

    // initialize matrices

    for(i=0; i<n; i++)

    {

        a = (int) rand();

        b = (int) rand();

        c = 0.0;

    }

    

    //offload code

    #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))

    

    //parallelize via OpenMP on MIC

    #pragma omp parallel for

    for( i = 0; i < n; i++ )

        for( k = 0; k < n; k++ )

            #pragma vector aligned

            #pragma ivdep

            for( j = 0; j < n; j++ )

                //c = c + a*b;

                    c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];

}

 

View solution in original post

0 Kudos
6 Replies
Dave_O_
Beginner
114 Views

Here is the program output:

[Offload] [MIC 0] [File]            matmul_offload.cpp
[Offload] [MIC 0] [Line]            19
[Offload] [MIC 0] [Tag]             Tag 0
offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

0 Kudos
jimdempseyatthecove
Black Belt
114 Views

 >> c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

n = 100
When i=1 and j=0 (start of inner loop) then c[i*n+j] is not aligned as you have so stated with #pragma vector aligned. Do not make false declarations.

Jim Dempsey

0 Kudos
Andrey_Vladimirov
New Contributor III
114 Views

If you work with "#pragma vector aligned" on Xeon Phi, then, in addition to using an aligned allocator, you have to pad the inner loop dimension (in your case, "n") to a multiple of 8 in double precision or a multiple of 16 in single precision. Otherwise, as Jim Dempsey explained above, your declaration becomes false for i>0.

0 Kudos
Dave_O_
Beginner
114 Views

Thanks Andrey - its my first Offload code for xeon phi. I usually compile code for native runs.

Could you kindly give me an example, or point me to a resource?

Much thanks

Dave

0 Kudos
Andrey_Vladimirov
New Contributor III
115 Views

Hi Dave,

in order to fix your code you can do something like below.

A nice paper about it is http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ; . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is http://www.colfax-intl.com/nd/xeonphi/book.aspx . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.

Andrey

 

#include <stdlib.h>

#include <math.h>

void main()

{

    double *a, *b, *c;

    int i,j,k, ok, n=100;
    int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );

    // allocated memory on the heap aligned to 64 byte boundary

    ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));

 

    // initialize matrices

    for(i=0; i<n; i++)

    {

        a = (int) rand();

        b = (int) rand();

        c = 0.0;

    }

    

    //offload code

    #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))

    

    //parallelize via OpenMP on MIC

    #pragma omp parallel for

    for( i = 0; i < n; i++ )

        for( k = 0; k < n; k++ )

            #pragma vector aligned

            #pragma ivdep

            for( j = 0; j < n; j++ )

                //c = c + a*b;

                    c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];

}

 

View solution in original post

0 Kudos
Juan_G_
Beginner
114 Views

I am using that code to know if Xeon Phi has bettter perfomance that only-Xeon.  I commented the instructions #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded)), #pragma vector aligned and #pragma ivdep for run on only-Xeon and uncommented that for run on Xeon-Phi but performance on only-Xeon is better than Xeon-phi, to complile I use icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul.mic -mmic for Xeon-Phi and icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul for only-Xeon. Please Could you help me with an simple example where using parallelization and vectorization Xeon-Phi performance is better than only-Xeon.

0 Kudos