Data persistence between offloads

Christopher_S_1 · ‎01-18-2013

Is it possible to have data persistence between offloads in different functions without using global variables? How do you let the MIC know to reuse the same memory? I know about using "nocopy" in the offload pragma, but I only seem to be able to get this to work when using the same variable name as in the previous offload (and not, for example, when using a pointer to the same memory in the host).

If my question needs more clarification, below is a simple stand-alone program that illustrates my question. It has three offloads. The intention is to initialize an array in the first offload, do some computation in the second offload without copying any data to or from the MIC, and read back the values (after additional modification) in the third offload. I want to time just the middle offload, in order to get just the time for computation without any data transfers to or from the card. Presumably I could add a chain of such operations before eventually transferring the data back to the host.

I am compiling it as follows:
icc -O2 -offload-build -offload-attribute-target=mic -openmp -vec-report3 -openmp-report -o micsimple simple.cpp

There are two #define's at the top of the program to illustrate three cases:
1) When none of the #defines are uncommented, all three offloads are in the main function, and the middle offload takes about 0.2 seconds.
2) When just the #define FUNCTION_CALL is uncommented, the middle offload occurs in another function, with the "nocopy" clause, and it results in an error: "offload error: process on the device 0 was terminated by signal 11".
3) When FUNCTION_CALL and GLOBAL_ARRAY are uncommented, the middle offload occurs in another function, but it uses the same global variable for the array as the main function (rather than a pointer passed to it as a parameter). This has no error but the middle offload is much slower than Case 1 (2.3 seconds).

It seems a little awkward to have to use global variables (or confine all offloads the same function) in order to let the MIC know to reuse memory from a previous offload. I am also unclear as to why the offload is slower in another function even when the global variable is used. Of course, I may be doing something wrong, or there may be a way around this that I have not realized.

Thanks!

#define SIZE 1000000000

//#define FUNCTION_CALL
//#define GLOBAL_ARRAY

#include <math.h>
#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
#include <stdio.h>

#ifdef GLOBAL_ARRAY
__declspec (target(mic)) float* array;
#endif

void myFunction(
#ifndef GLOBAL_ARRAY
float* array,
#endif
int nthreads)
{
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 2.0f*array;
}
}

int main(int argc, char* argv[])
{
struct timeval begin, end, diff;
int nthreads = 16;

#ifndef GLOBAL_ARRAY
static __declspec (target(mic)) float* array;
#endif
array = (float*)malloc(SIZE*sizeof(float));

#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(1) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 1.0f;
}

gettimeofday(&begin, 0);

#ifdef FUNCTION_CALL
myFunction(
#ifndef GLOBAL_ARRAY
    array,
#endif
    nthreads);
#else
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 2.0f*array;
}
#endif

gettimeofday(&end, 0);
timersub(&end, &begin, &diff);
float seconds = diff.tv_sec + 1.0E-6*diff.tv_usec;
printf("Seconds: %f\n", seconds);

#pragma offload target(mic) out(array : length(SIZE) alloc_if(0) free_if(1))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 2.0f*array;
}

for (unsigned int i=0; i<16; i++)
    printf("%f ", array[rand() % SIZE]);
printf("\n");

return 0;
}

Andrey_Vladimirov · ‎01-18-2013

Hi Christopher,

I was also recently trying to implement data persistence on the coprocessor in an offload from a function. The only method that I was successful with is declaring a static array in the scope of the function to hold the data, and marking it with "__attribute__((target(mic)))" (or __declspec(target(mic)) ). It actually makes sense that "nocopy" fails for non-static and non-global pointers: the variable that holds persistent data must survive between multiple function calls. So static or global variables seem like the only way to go. Here is a (dirty) code that illustrates the approach that I ended up with:

#include <stdio.h>
#include <cstring>

void foo(const char* data, const int n) {

static const char* persistentData __attribute__((target(mic))) = NULL;

if (data != persistentData) {
printf("Offloading data...\n");
persistentData = data;
#pragma offload_transfer target(mic:0) in(persistentData : length(n) alloc_if(1) free_if(0))
}

#pragma offload target(mic:0) nocopy(persistentData : length(n) alloc_if(0) free_if(0))
{
printf("Re-using data...\n");
for (int i = 0; i < n; i++)
printf("%c", persistentData);
fflush(0);
}
}

int main() {
char d[50] = "Hello World!\n";
const int n = std::strlen(d);
foo(d, n);
foo(d, n);
strcpy(d, "Goodbye cruel world!\n"); // Will NOT be printed
foo(d, n);
}

Result:

$ icpc foo.cc
$ ./a.out
Offloading data...
Re-using data...
Hello World!
Re-using data...
Hello World!
Re-using data...
Hello World!
$

Christopher_S_1 · ‎01-22-2013

Thanks; that is very helpful!

Chris

Christopher_S_1 · ‎01-23-2013

Having worked with this a little more, I have a follow-up question.

Your code works well for what it was intended to do: allow you to call the same function consecutively with the same input data, reusing the MIC memory. However, it still seems subject to the constraint that MIC memory can only be reused when the offload pragma is referenced using the same literal variable name. Is this an unavoidable constraint? Is there no way to refer to an address in the MIC memory?

What if I wanted to call your function with several different input arrays, multiple times each, interspersed with each other? What if I wanted to allocate the MIC memory in one function and then reuse it in another?

Below is another simple program to illustrate my question another way. If I have "#define USE_ARRAY array1", it works correctly. However, if I have "#define USE_ARRAY array2", it crashes on the offload, even though array1 and array2 are set to be exactly the same thing (at least in host memory). Thus, it seems that MIC memory is only associated with a variable name, rather than with a variable value. I understand what you were saying about how the variable needs to be global or static to maintain persistence (as opposed to a stack variable in a function), but I don't see why it does not seem possible to refer to MIC memory allocated using the name of one global variable with another global variable. I guess there must be a good reason for this, but it seems to make designing clean, efficient code almost impossible.

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_ARRAY array1
//#define USE_ARRAY array1 // works correctly
#define USE_ARRAY array2 // crashes on offload

int* array1;
int* array2;

int main(int argc, char* argv[])
{
int n = 10;

array1 = new int;
array2 = array1;

#pragma offload_transfer target(mic:0) nocopy(ALLOC_ARRAY : length(n) alloc_if(1) free_if(0))

#pragma offload target(mic:0) nocopy(USE_ARRAY : length(n) alloc_if(0) free_if(0))
{
for (unsigned int i=0; i<n; i++) USE_ARRAY = 5;
}

#pragma offload_transfer target(mic:0) out(USE_ARRAY : length(n) alloc_if(0) free_if(1))

for (unsigned int i=0; i<n; i++)
printf("%d ", USE_ARRAY);
printf("\n");

return 0;
}

Christopher_S_1 · ‎01-23-2013

I found this from the FAQ at the top of the forum:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

Based on the "Local Pointers Versus Pointers Used Across Offloads" section, I got the modified version of the code from my previous post to work by changing the offload that reuses previously allocated data to be an in with length 0 rather than a nocopy of length n. I don't fully understand this yet, but maybe I am getting on the right track. If I comment out the line "array2 = array1;", it crashes, so apparently maybe there is a connection between the value of a pointer and the memory with which it is associated on the MIC.

Anyway, I will continue reading that link and experimenting, but if anyone thinks they can clear up my confusion more quickly, please post.

Thanks!

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_ARRAY array1
#define USE_ARRAY array2

int* array1;
int* array2;

int main(int argc, char* argv[])
{
int n = 10;

array1 = new int;
array2 = array1;

#pragma offload_transfer target(mic:0) nocopy(ALLOC_ARRAY : length(n) alloc_if(1) free_if(0))

#pragma offload target(mic:0) in(USE_ARRAY : length(0) alloc_if(0) free_if(0))
{
for (unsigned int i=0; i<n; i++) USE_ARRAY = 5;
}

#pragma offload_transfer target(mic:0) out(USE_ARRAY : length(n) alloc_if(0) free_if(1))

for (unsigned int i=0; i<n; i++)
printf("%d ", USE_ARRAY);
printf("\n");

return 0;
}

Sumedh_N_Intel · ‎01-24-2013

Hi Chris,

I am sorry for all this confusion. The host pointer value is used as a key in a table of host-coprocessor pointer data associations. The association is not by name, but by pointer value. If you had two pointer variables on the host with the same value, then, yes, they will be mapped to the same allocated memory on the coprocessor.

As for your previous code, it did not work because of this statement:

#pragma offload target(mic:0) nocopy(USE_ARRAY : length(n) alloc_if(0) free_if(0))

This statement simply created the pointer 'array2' on the coprocessor but did not update it's value to point the previously allocated array. By using in with length(0), you told the compiler to create the pointer 'array2' but also to update (refresh) the pointer value to point to the previously allocated memory address. That is why your second code worked.

Hopefully the following code will further clarify the ultility of in with length(0):

void f()

{

     int * p = malloc(…);

     …

     // 100 elements are transferred to MIC, from p on the CPU

// In subsequent offloads the MIC memory for the transferred data can be located on the CPU using the value of p

     #pragma offload_transfer in(p[0:100] alloc_if(1) free_if(0))

     g(p);

}

void g(int *q)

{

     // This pragma does not do data transfer; previously sent data is reused

     // That’s why the length is 0, meaning don’t send data

     // However, we used the “in” clause instead of “nocopy”, to update the MIC pointer q with the previously allocated memory address

     // Because p (at the time of MIC memory allocation) and q in this offload have the same value, the MIC memory is located

     #pragma offload … in(q[0:0])

     {

           int x = q[0] + q[99];

}

}

I hope this clarify things.

-Sumedh

Kevin_D_Intel · ‎01-25-2013

To add to what Sumedh's said, the program (in the first post) works using the FUNCTION method when one creates the instance of the pointer ARRAY in the function using IN with length(0).

Instead of this:
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))

use this:
#pragma offload target(mic) in(array : length(0) alloc_if(0) free_if(0))

Andrey_Vladimirov · ‎01-25-2013

Chris, in my code I had to call the function with multiple arrays, all of which had to be persistent. In order to pull this off, I had a single pointer to hold all my data on the card in a single big chunk. I did memory management within that big chunk of data "manually". That is, when I had to use one of the data sets, I used a local pointer in the MIC code to point to the respective section of the big array.

Thanks, Sumedh and Kevin! It is good to know that this task can be done without the acrobatics that I described.

Christopher_S_1 · ‎01-28-2013

Thank you very much Sumedh, Kevin, and Andrey! That was definitely helpful.

Ravishankar_S_ · ‎02-14-2013

I'm trying to add MIC support to a large DFT code that does a lot of memory management between main memory and GRAM (currently in CUDA). I couldn't glean from the documentation so far whether this was possible to do in an organized manner in offload mode - I was almost considering doing it over MPI with native mode instead. This thread has been particularly useful for me to figure out how to achieve this correctly - thanks to all involved!

It would be very helpful if a simple example of memory management (without global or static pointers) based on the examples discussed above was added to the "Managing Memory Allocation for pointer variables" page of the MIC documentation (http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/GUID-82634DA6-2DC9-4552-A77D-381E344D3029.htm).

Thanks again!

Kevin_D_Intel · ‎02-14-2013

Thank you for this feedback. I appreciate you taking time to post. Expanding/improving the topic noted is planned for a future release so we will consider your feedback with that. Thank you again.

Ravi_N_Intel · ‎02-14-2013

-offload-build option has been deprecated. You don't need to use the option now since the compiler automatically detects offload constructs and generates the necessary MIC code.

Marco_Comparato · ‎05-31-2013

Hi,

I don't understand if a code like the following is possible or not.

Thank you in advance,

Marco

[cpp]

#include <stdlib.h>

void device_malloc(int *t);
void device_free(int *t);

int main()
{
int *p;

p = (int *)malloc(3);

device_malloc(p);
device_free(p);

free(p);
}

void device_malloc(int *t)
{
#pragma offload target(mic:0) <FILL PLEASE>
{
t = (int *)malloc(3);
}
}

void device_free(int *t)
{
#pragma offload target(mic:0) <FILL PLEASE>
{
free(t);
}
}

[/cpp]

TaylorIoTKidd · ‎08-29-2013

Unfortunately, your post is one of the unfortunate that was missed.

The code is possible assuming you specify that the offload runtime transfers the data pointed to by 't', but it is incorrect in a few ways.

(int *)malloc(3) is an unusual length
You need to specify something like inout(t:length(x)) but length(x) doesn't make much sense with respect to your malloc(3)
t=(int*)malloc(3) creates a memory leak on the card since the heaps on the host and the coprocessor are separate and unique

I believe what you want to do is to malloc and persist a specified amount of data on the card. To do that, first malloc the data on the host, and then pass the pointer and the amount of data to the offload statement. For example,

int *t;
t = (int*) malloc((sizeof int)*N);
#pragma offload target(mic:0) in(t:length(N) alloc_if(1) free_if(0))
{...}
#pragma offload target(mic:0) nocopy(t:length(N) alloc_if(0) free_if(0))
{...}
#pragma offload target(mic:0) out(t:length(N) alloc_if(0) free_if(1))
{...}

Outside of minor syntax errors, this will (1) allocate an array on the host, (2) allocate that same amount on the coprocessor and copy the data over, (3) persist the data to the next offload. You can find examples of this under the Samples directory for the C++ compiler.

Regards
--
Taylor

Sumedh_N_Intel · ‎08-29-2013

Hi Marco,

You might find the following threading helpful:

http://software.intel.com/en-us/forums/topic/382988#comment-1731861

Amir_G_ · ‎01-10-2014

What Taylor suggests seems to be only working when the offloads are being called inside one function. If the offloads are being called from different C function a segmentation error is returned. I experimented this with the following code:

[cpp]

int main( int argc, char* argv[])
{

const int N = 1000;
double * a,*b,*c;

a=(double*) malloc(sizeof(double)*N*N);
b=(double*) malloc(sizeof(double)*N*N);
c=(double*) malloc(sizeof(double)*N*N);

// Initialize Data
for ( int i = 0; i < N; ++i )
for(int j=0; j<N; ++j){
a[j+i*N]=i+j;
b[j+i*N]=i+j;
c[j+i*N]=0.0;
}

// Call alloc_mic to just initialize and transfer a, b and c to coprocessor

alloc_mic(a,b,c,N);

// call MMult_mic to compute Matrix Matrix Multiplication c=a*b, with a,b, and c already in the PHI memory
MMult_mic( a,b,c,N );

return 0;
}

void alloc_mic( double * a, double*b,double *c, int N)
{

#pragma offload_transfer target(mic:0)\
in(a:length(N*N) alloc_if(1) free_if(0))\
in(b:length(N*N) alloc_if(1) free_if(0))\
in(c:length(N*N) alloc_if(1) free_if(0))
{
int nth=omp_get_max_threads();
printf("MIC Max Threads %f\n",1.0*nth);
}

}
void MMult_mic( double * a, double*b,double *c, int N )
{

#pragma offload target(mic:0) \
nocopy(a:length(N*N) alloc_if(0) free_if(0))\
nocopy(b:length(N*N) alloc_if(0) free_if(0))\
out(c:length(N*N) alloc_if(0) free_if(0))
{

#pragma omp parallel for
for ( int i = 0; i < N; ++i )
for(int k=0; k<N; ++k)
for(int j=0; j<N; ++j)
c[j+i*N]+=a[k+i*N]*b[j+k*N];
}
}

[/cpp]

This returns a segmentation fault. However, if I had the second offload (in MMult_mic) done inside alloc_mic function (after its offload is done) there would be no problem. So the question is what is the problem? I have included the source file as an attachment.

Dhairya_M_ · ‎01-13-2014

Kevin, what you said solved my problem. But the syntax is counter intuitive:

#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))

#pragma offload target(mic) in(array : length(0) alloc_if(0) free_if(0))

In the first case, when we specify alloc_if(0) free_if(0), why will the compiler assume that we are resizing (when we explicitly say no-alloc). In fact the second syntax suggests that we want the compiler to copy the array from CPU to Phi using the previously allocated size (since we dont specify a size but we do say in(...) ).

Intel please fix this messed up notation!! Its been over a year and there are still many such issues!!

Dhairya_M_ · ‎01-14-2014

Intel should at least fix the documentation, because the documentation uses "nocopy(array)" instead of the code that works.

Kevin_D_Intel · ‎01-16-2014

@Amir - your code should use IN with length(0) vs. NOCOPY in MMult_MIC, e.g.

Instead of:
nocopy(a:length(N*N) alloc_if(0) free_if(0))\
nocopy(b:length(N*N) alloc_if(0) free_if(0))\

Use:
in(a:length(0) alloc_if(0) free_if(0))\
in(b:length(0) alloc_if(0) free_if(0))\

If you have not already done so, refer to the FAQ cited in post #5 above and related blog post: Behind the Scenes: Offload Memory Management on the Intel® Xeon Phi™ coprocessor

@Dhairya - I posted in your other thread here: http://software.intel.com/forums/topic/499257. We apologize for the incorrect examples in the 13.1 documentation. I believe we have corrected those that appeared in the 13.1 C++ User Guide in our 14.0 (Composer XE 2013 SP1) release. Is it possible you might have referenced the 13.1 version of the guide?

Amir_G_ · ‎01-16-2014

@Kevin:

I would highly appreciate it if you would clarify when/where we should use nocopy? According to Taylor's post above nocopy should have worked but it is not working. So are you suggesting that in Taylor's code above one should have used in with length(0)? Then what is the point of this nocopy clause?

I suspect that there is a bug in intel's nocopy implementation. The reason I am saying this, is that if two pragmas are in the same function nocopy works but when the second pragma is called, it fails to work.

Ravi_N_Intel · ‎01-16-2014

nocopy means nothing is sent or recieved.

in(....length(0)...) means the pointer for the memory is sent

Since your pointer is on the host stack (passed as argument to the function MMult_mic) you need to update the corresponding stack pointer variable on the card with memory which was allocated on the host in an earlier pragma directive.