Software Archive
Read-only legacy content
17061 Discussions

How to compile a program with Intrinsic on MIC

Feng_L_
Beginner
790 Views

I tried to compile a program with intrinsic such as "_mm512_load_ps(...)" on MIC using icc compiler.  But i meet a problem with the code below.

……

#pragma offload_attribute (push,target(mic))
#include "immintrin.h"
#pragma offload_attribute (pop)

……

#pragma offload_attribute (push,target(mic))

......

_d_wt=_mm512_load_ps (&Random_matrix);
_Xk=_mm512_add_ps(_mm512_set_1to16_ps(X),_d_wt);

......

#pragma offload_attribute (pop)

……

When i tryed to compile the code, i got the information:

ThetaScheme.o: In function `current_solution':
ThetaScheme.c:(.text+0xd86): undefined reference to `_mm512_load_ps'
ThetaScheme.c:(.text+0xda1): undefined reference to `_mm512_set1_ps'

......

Should i link some special library or do some extra work to compile this kind of program ?

Thank you !

0 Kudos
7 Replies
Kevin_D_Intel
Employee
790 Views

The undefined references occur when linking for the host-side since those do not exist for the host. You would conditionalize the use in the offload code with:

#ifdef __MIC__

<Phi intrinsic code here>

#else

<Host side equivalent code here>

#endif

If you have not already, look at the C sample under:

/opt/intel/composer_xe_2013/Samples/en_US/C++/mic_samples/intro_sampleC/sampleC006.c

 

0 Kudos
Feng_L_
Beginner
790 Views

Kevin Davis (Intel) wrote:

The undefined references occur when linking for the host-side since those do not exist for the host. You would conditionalize the use in the offload code with:

#ifdef __MIC__

<Phi intrinsic code here>

#else

<Host side equivalent code here>

#endif

If you have not already, look at the C sample under:

/opt/intel/composer_xe_2013/Samples/en_US/C++/mic_samples/intro_sampleC/sampleC006.c

 

Thank you a lot!  It compiled the program successfully.

0 Kudos
Feng_L_
Beginner
790 Views

  Thanks for the reply above,but i have a new problem. When i use "__mm512_store_ps(...)" (or other functons like this) to store data to an array ,i got an error when i run the program.
      
      offload error: process on the device 0 was terminated by signal 11
      
      And the code:
      
      
      #pragma offload_attribute (push,target(mic))
      void function(...)
      {
      static float __attribute__((target(mic),aligned(64))) d_wt[BLOCKSIZE*THREADSNUM] ;
      
      ... ...
      
      #pragma omp parallel for  private(... ,start,_d_wt,...) schedule(dynamic,1)
      {
      ...
      _d_wt=_mm512_loadunpacklo_ps (_d_wt, (void*)(&Random_matrix)  );
      _d_wt=_mm512_loadunpackhi_ps (_d_wt, (void*)(&Random_matrix)  );
      //_d_wt=_mm512_load_ps (&Random_matrix);
      
      _mm512_packstorelo_ps((void*)(&d_wt[start])   , _d_wt );
      _mm512_packstorehi_ps((void*)(&d_wt[start+16]), _d_wt );
      
      // ------>   start= 16 * omp_get_thread_num();
      
      
      //_mm512_store_ps(d_wt,_d_wt);
      ...
      }
      ... ...
      
      }
      
      I wonder why this happened.Thanks.~

0 Kudos
Kevin_D_Intel
Employee
790 Views

Many times this indicates accessing outside available memory; could be due to insufficient allocation for variables used in the offloaded code. Inspect the variables used in the offloaded code to ensure sufficient memory is allocated and they are decorated accordingly for access in offloaded code. Because it is not obvious from the code snippet, check that Random_matrix it is declared accordingly for access within offloaded code.

0 Kudos
Feng_L_
Beginner
790 Views

Thank you . I have checked the Random_matrix . It seems ok . I write a new program to show the problem. There are two cases. In case two,i use an usual method to calculate,and there is no error . In case one,i use "_mm512_store_ps" to store data to array. But i got the same error "offload error: process on the device 0 was terminated by signal 11" I also try to use "_mm512_store_ps((void*)(&B),_C);" instead of "_mm512_store_ps((void*)(&X),_C);" to store data to array B. But i still got the same error.I pasted all code below .

#include <stdlib.h>
#include <stdio.h>

#define SIZE 1024

#define CASE1
//#define CASE2

#pragma offload_attribute (push,target(mic))
#include "immintrin.h"
void calculate(float* A,float* B)
{
        static float __attribute__((target(mic),aligned(64))) X[SIZE] ;

         int k,i;

        __m512 _A;
        __m512 _B;
        __m512 _C;


      #ifdef __MIC__
        for(k=0;k<SIZE;k+=16)
        {

                _A=_mm512_load_ps((void*)(&A));
                _B=_mm512_load_ps((void*)(&B));

                _C=_mm512_add_ps(_A,_B);

                #ifdef CASE1
                _mm512_store_ps((void*)(&X),_C);
                #endif

                for(i=k;i<k+16;i++)
                {
                        #ifdef CASE2
                        X=A+B;
                        #endif

                        printf("%2.f ",X);
                }
                printf("\n");
        }
        #endif

}

#pragma offload_attribute (pop)

int  main()
{
        int i;
        float *A;
        float *B;
        A=(float*)malloc(sizeof(float)*SIZE);
        B=(float*)malloc(sizeof(float)*SIZE);
        for(i=0;i<SIZE;i++)
        {
                A=i;
                B=i;
        }

        #pragma offload_transfer target(mic:0)\
        in(A:length(SIZE) alloc_if(1) free_if(0))\
        in(B:length(SIZE) alloc_if(1) free_if(0))\
                 
        #pragma offload target(mic:0) \
        in(A:length(0) alloc_if(0) free_if(0))\
        in(B:length(0) alloc_if(0) free_if(0))
        calculate(A,B);

        return 1;
}

0 Kudos
Kevin_D_Intel
Employee
790 Views

Appears A and B are not 64-byte aligned. Try _mm_malloc:

A=(float*)_mm_malloc(sizeof(float)*SIZE,64);

B=(float*)_mm_malloc(sizeof(float)*SIZE,64);

 

0 Kudos
Feng_L_
Beginner
790 Views

Thank you very much!

Problem was solved with your advice! 

0 Kudos
Reply