Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

Xeon Phi Segmentation Fault Simple Offload

Dave_O_
Beginner
1,447 Views

I have this simple matrix multiply for offload on Phi, but I get  offload error (SIGSEGV) when I run the program below:

#include <stdlib.h>
#include <math.h>

void main()
{
    double *a, *b, *c; 
    int i,j,k, ok, n=100;

    // allocated memory on the heap aligned to 64 byte boundary 
    ok = posix_memalign((void**)&a, 64, n*n*sizeof(double)); 
    ok |= posix_memalign((void**)&b, 64, n*n*sizeof(double)); 
    ok |= posix_memalign((void**)&c, 64, n*n*sizeof(double));


    // initialize matrices 
    for(i=0; i<n; i++)
    {
        a = (int) rand();
        b = (int) rand();
        c = 0.0;
    }
    
    //offload code 
    #pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n)) 
    
    //parallelize via OpenMP on MIC 
    #pragma omp parallel for 
    for( i = 0; i < n; i++ ) 
        for( k = 0; k < n; k++ ) 
            #pragma vector aligned 
            #pragma ivdep 
            for( j = 0; j < n; j++ ) 
                //c = c + a*b
                    c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

}

What am I doing wrong?

I read a previous post that there might be a known bug in the release?

 

 

0 Kudos
1 Solution
Andrey_Vladimirov
New Contributor III
1,447 Views

Hi Dave,

in order to fix your code you can do something like below.

A nice paper about it is http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ; . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is http://www.colfax-intl.com/nd/xeonphi/book.aspx . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.

Andrey

 

#include <stdlib.h>

#include <math.h>

void main()

{

    double *a, *b, *c;

    int i,j,k, ok, n=100;
    int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );

    // allocated memory on the heap aligned to 64 byte boundary

    ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));

 

    // initialize matrices

    for(i=0; i<n; i++)

    {

        a = (int) rand();

        b = (int) rand();

        c = 0.0;

    }

    

    //offload code

    #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))

    

    //parallelize via OpenMP on MIC

    #pragma omp parallel for

    for( i = 0; i < n; i++ )

        for( k = 0; k < n; k++ )

            #pragma vector aligned

            #pragma ivdep

            for( j = 0; j < n; j++ )

                //c = c + a*b;

                    c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];

}

 

View solution in original post

0 Kudos
6 Replies
Dave_O_
Beginner
1,447 Views

Here is the program output:

[Offload] [MIC 0] [File]            matmul_offload.cpp
[Offload] [MIC 0] [Line]            19
[Offload] [MIC 0] [Tag]             Tag 0
offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,447 Views

 >> c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

n = 100
When i=1 and j=0 (start of inner loop) then c[i*n+j] is not aligned as you have so stated with #pragma vector aligned. Do not make false declarations.

Jim Dempsey

0 Kudos
Andrey_Vladimirov
New Contributor III
1,446 Views

If you work with "#pragma vector aligned" on Xeon Phi, then, in addition to using an aligned allocator, you have to pad the inner loop dimension (in your case, "n") to a multiple of 8 in double precision or a multiple of 16 in single precision. Otherwise, as Jim Dempsey explained above, your declaration becomes false for i>0.

0 Kudos
Dave_O_
Beginner
1,446 Views

Thanks Andrey - its my first Offload code for xeon phi. I usually compile code for native runs.

Could you kindly give me an example, or point me to a resource?

Much thanks

Dave

0 Kudos
Andrey_Vladimirov
New Contributor III
1,448 Views

Hi Dave,

in order to fix your code you can do something like below.

A nice paper about it is http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ; . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is http://www.colfax-intl.com/nd/xeonphi/book.aspx . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.

Andrey

 

#include <stdlib.h>

#include <math.h>

void main()

{

    double *a, *b, *c;

    int i,j,k, ok, n=100;
    int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );

    // allocated memory on the heap aligned to 64 byte boundary

    ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));

 

    // initialize matrices

    for(i=0; i<n; i++)

    {

        a = (int) rand();

        b = (int) rand();

        c = 0.0;

    }

    

    //offload code

    #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))

    

    //parallelize via OpenMP on MIC

    #pragma omp parallel for

    for( i = 0; i < n; i++ )

        for( k = 0; k < n; k++ )

            #pragma vector aligned

            #pragma ivdep

            for( j = 0; j < n; j++ )

                //c = c + a*b;

                    c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];

}

 

0 Kudos
Juan_G_
Beginner
1,446 Views

I am using that code to know if Xeon Phi has bettter perfomance that only-Xeon.  I commented the instructions #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded)), #pragma vector aligned and #pragma ivdep for run on only-Xeon and uncommented that for run on Xeon-Phi but performance on only-Xeon is better than Xeon-phi, to complile I use icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul.mic -mmic for Xeon-Phi and icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul for only-Xeon. Please Could you help me with an simple example where using parallelization and vectorization Xeon-Phi performance is better than only-Xeon.

0 Kudos
Reply