Solved: Hello,

Sylvain_C_1 · ‎08-31-2015

Hello,

I would like to pre-allocate a number of buffers for later data transfers from CPU to MIC, using explicit offloading in C++.

It works nicely if each buffer corresponds to an explicit variable name, as e.g. in the double-buffering examples. However, I would like to have a configurable number of such buffers (more than 2), i.e. an array of buffers. (the buffers are used for asynchronous processing on the MIC, and I need quite a few of them).

I do have a workaround, i.e. allocate a single very big buffer, and cut it into pieces (by using offsets and 'into' for transfers), but as the buffers do not need to be to be contiguous, I'm afraid adding this constraint may cause problems to find a big block available at runtime. So I would prefer to have several smaller buffers if possible.

The code below will probably describe easily the issue. In the first part, it works fine with 2 variable names. But in the second part, with an array, I don't find how to proceed (or is it simply not possible?). I tried without success various syntaxes, but could not find one accepted by the compiler.

I would be glad if someone could help on this matter. Thanks in advance for any feedback on this!

cheers, Sylvain

#pragma offload_attribute (push,target(mic))
#include <stdio.h>
#pragma offload_attribute (pop)

#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

int main() {

  int size=100;      // size of buffer
  char input[size];  // buffer for input data on the CPU
 
  char *ptr1=NULL;  // reference to MIC buffer 1
  char *ptr2=NULL;  // reference to MIC buffer 2

  // pre-allocate MIC buffers
  #pragma offload_transfer target(mic:0) nocopy(ptr1 : length(size) ALLOC)
  #pragma offload_transfer target(mic:0) nocopy(ptr2 : length(size) ALLOC)

  // test use of buffer 1
  snprintf(input,size,"valPtr1");
  #pragma offload target(mic:0) in(input[0:size] : REUSE into(ptr1[0:size]))
  {
    printf("MIC: %p = %s\n",ptr1,ptr1);
  }

  // test use of buffer 2
  snprintf(input,size,"valPtr2");
  #pragma offload target(mic:0) in(input[0:size] : REUSE into(ptr2[0:size]))
  {
    printf("MIC: %p = %s\n",ptr2,ptr2);
  }


  // try to do same as above, but with an array instead of fixed variable names ptr1,ptr2
  // so that number of elements can be increased and iterated
  // e.g. instead of ptr1 and ptr2, use ptrX[1], ptrX[2] ... ptrX
 
  // compiler does not seem to complain for the allocation
  // but it crashes at runtime
  char *ptrX[2]={NULL,NULL};
  for (int i=0;i<2;i++) {
    #pragma offload_transfer target(mic:0) nocopy(ptrX : length(size) ALLOC)
  }

  // and then, how to use the buffers ???
  /*
  for (int i=0;i<2;i++) {
    snprintf(input,size,"valPtrX%d",i);
    #pragma offload target(mic:0) in(input[0:size] : REUSE into((???)[0:size]))
    {
      printf("MIC: %p = %s\n",???,???);
    }
  }
  */
  
  return 0;
}

Rajiv_D_Intel · ‎09-01-2015

The targetptr modifier is available to declare MIC-only buffers. When you allocate q on MIC, use the targetptr modifier. Then, the existing values in q on the CPU are ignored, MIC buffers allocated for q, and the values in q on the CPU are updated with addresses of MIC buffers. From this point on, the q values should not be directly used on the CPU, but only through the offload pragmas.

To transfer data into or out of q, use the targetptr modifier. Similarly, when deleting the MIC buffers when you are done with them, use the targetptr modifier.

View solution in original post

Kevin_D_Intel · ‎08-31-2015

If you have not already seen the article Data transfer of an “array of pointers” using the Intel® Language Extensions for Offload (LEO) for the Intel® Xeon Phi™ coprocessor, I believe it offers a method to fit your interests. If not, then please let us know.

Sylvain_C_1 · ‎09-01-2015

Hello, and thanks for your fast feedback.

I managed to use array indirection as in the document you recommended (in particular, starting with the very last example in the ref manual https://software.intel.com/en-us/node/524507 describing the copy "into" with arrays).

with a call such as:

#pragma offload target(mic) in (p[0:1] : extent(0:DATA_ELEMS) into(q[ix:1]) into_extent(0:DATA_ELEMS)

p[0] points to my CPU input data buffer (p is an array of size 1)
ix is the index of the destination MIC buffer selected for this transfer (q is an array of size N)
each input and destination buffers are of size DATA_ELEMS

However, I am not sure how to declare and allocate the array q and corresponding destination buffers q[0]...q[N-1] on the MIC ONLY.
I tried a number of things but failed to get it working without q[] initialized also on the CPU.

To summarize what I'm looking to do:

1) once at init: pre-allocate N blocks of size S on the MIC only
2) iteratively at runtime: transfer data from one arbitrary CPU address (and length<=S) into one of these MIC buffers

Please let me know if you have any suggestion.

best regards,
Sylvain

Rajiv_D_Intel · ‎09-01-2015

The targetptr modifier is available to declare MIC-only buffers. When you allocate q on MIC, use the targetptr modifier. Then, the existing values in q on the CPU are ignored, MIC buffers allocated for q, and the values in q on the CPU are updated with addresses of MIC buffers. From this point on, the q values should not be directly used on the CPU, but only through the offload pragmas.

To transfer data into or out of q, use the targetptr modifier. Similarly, when deleting the MIC buffers when you are done with them, use the targetptr modifier.

Sylvain_C_1 · ‎09-02-2015

Hello,

many thanks for the hint, this is exactly what I needed! I could find the documentation about this targetptr feature only in the ICC 16.0 documentation, although it seems to work perfectly fine with my version 15.0.3. I used the description found at: https://software.intel.com/en-us/node/583639

And for reference, I paste below a full working example of what I was looking to achieve.

Best regards,

Sylvain

// this is a working example of arbitrary CPU pointer to MIC pre-allocated buffers copy

#pragma offload_attribute (push,target(mic))
#include <stdio.h>
#pragma offload_attribute (pop)

#include <stdlib.h>

#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

#define MIC_NBUF 5     // number of buffers on MIC
#define CPU_NBUF 3     // number of buffers on CPU
#define DATA_ELEMS 1000000     // number of items in each buffer (CPU and MIC)
#define ALIGN_COUNT 2*1024*1024       // align boundary


int main() {
 
  __declspec(target(mic)) short int *p[1];            // an array variable of size 1 for input pointer data indirection
  __declspec(target(mic)) short int *q[MIC_NBUF];     // an array to hold the MIC buffers
  __declspec(target(mic)) int ix=0;                   // index of current MIC buffer in use

  // create some input buffers on the CPU, filled with dummy data to be transfered
  short int *buf[CPU_NBUF]; // CPU buffers
  for (int i=0; i<CPU_NBUF; i++) {
    buf=(short int *)_mm_malloc(sizeof(short int)*DATA_ELEMS,ALIGN_COUNT);
    for (int j=0;j<DATA_ELEMS;j++) {
      buf=i*10+j%10;
    }    
  }

  // we don't use q[] on the CPU, just fill it with NULL pointers
  for (int i=0; i<MIC_NBUF; i++) {      
    q=NULL;
  }
 
  // allocate q[0] q[1] ... q[MIC_NBUF-1] on the MIC ONLY (aligned)
   #pragma offload_transfer target(mic) nocopy (q[0:MIC_NBUF] : extent(0:DATA_ELEMS) ALLOC targetptr align(ALIGN_COUNT))

  // transfer from the CPU buffers to the MIC buffers round-robin
  for (int i=0;i<10;i++) {
    ix= i % MIC_NBUF;       // index of the MIC buffer to use as detination
    p[0]=buf[i%CPU_NBUF];   // pointer to the CPU buffer to use as source

    // copy  DATA_ELEMS * 'short int' data pointed by p[0] on the CPU to pre-allocated buffer pointed by q[ix] on the MIC
    #pragma offload target(mic) in (ix) nocopy(q) in (p[0:1] : extent(0:DATA_ELEMS) into(q[ix:1]) into_extent(0:DATA_ELEMS) REUSE targetptr )
     {
       printf("MIC ix=%d ptr=%p value=%d\n",ix,q[ix],(int)q[ix][0]);
        for (int j=0; j<MIC_NBUF; j++) {
           printf("q[%d][0]=%d\n",j,q[0]);
           printf("q[%d][1]=%d\n",j,q[1]);
           printf("q[%d][2]=%d\n",j,q[2]);
         }
     }
  }

  return 0;
}

Kevin_D_Intel · ‎09-03-2015

The functionality was made available in 15.0 for convenience of some early evaluation/testing before being officially announced in 16.0. Glad you found the solution you were looking for and thank you for sharing that for the benefit of others.

offload_transfer: array of variables?