So Ravi, what you are saying - Page 2

Christopher_S_1 · ‎01-18-2013

Is it possible to have data persistence between offloads in different functions without using global variables? How do you let the MIC know to reuse the same memory? I know about using "nocopy" in the offload pragma, but I only seem to be able to get this to work when using the same variable name as in the previous offload (and not, for example, when using a pointer to the same memory in the host).

If my question needs more clarification, below is a simple stand-alone program that illustrates my question. It has three offloads. The intention is to initialize an array in the first offload, do some computation in the second offload without copying any data to or from the MIC, and read back the values (after additional modification) in the third offload. I want to time just the middle offload, in order to get just the time for computation without any data transfers to or from the card. Presumably I could add a chain of such operations before eventually transferring the data back to the host.

I am compiling it as follows:
icc -O2 -offload-build -offload-attribute-target=mic -openmp -vec-report3 -openmp-report -o micsimple simple.cpp

There are two #define's at the top of the program to illustrate three cases:
1) When none of the #defines are uncommented, all three offloads are in the main function, and the middle offload takes about 0.2 seconds.
2) When just the #define FUNCTION_CALL is uncommented, the middle offload occurs in another function, with the "nocopy" clause, and it results in an error: "offload error: process on the device 0 was terminated by signal 11".
3) When FUNCTION_CALL and GLOBAL_ARRAY are uncommented, the middle offload occurs in another function, but it uses the same global variable for the array as the main function (rather than a pointer passed to it as a parameter). This has no error but the middle offload is much slower than Case 1 (2.3 seconds).

It seems a little awkward to have to use global variables (or confine all offloads the same function) in order to let the MIC know to reuse memory from a previous offload. I am also unclear as to why the offload is slower in another function even when the global variable is used. Of course, I may be doing something wrong, or there may be a way around this that I have not realized.

Thanks!

#define SIZE 1000000000

//#define FUNCTION_CALL
//#define GLOBAL_ARRAY

#include <math.h>
#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
#include <stdio.h>

#ifdef GLOBAL_ARRAY
__declspec (target(mic)) float* array;
#endif

void myFunction(
#ifndef GLOBAL_ARRAY
float* array,
#endif
int nthreads)
{
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 2.0f*array;
}
}

int main(int argc, char* argv[])
{
struct timeval begin, end, diff;
int nthreads = 16;

#ifndef GLOBAL_ARRAY
static __declspec (target(mic)) float* array;
#endif
array = (float*)malloc(SIZE*sizeof(float));

#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(1) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 1.0f;
}

gettimeofday(&begin, 0);

#ifdef FUNCTION_CALL
myFunction(
#ifndef GLOBAL_ARRAY
    array,
#endif
    nthreads);
#else
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 2.0f*array;
}
#endif

gettimeofday(&end, 0);
timersub(&end, &begin, &diff);
float seconds = diff.tv_sec + 1.0E-6*diff.tv_usec;
printf("Seconds: %f\n", seconds);

#pragma offload target(mic) out(array : length(SIZE) alloc_if(0) free_if(1))
#pragma omp parallel num_threads(nthreads)
{
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array = 2.0f*array;
}

for (unsigned int i=0; i<16; i++)
    printf("%f ", array[rand() % SIZE]);
printf("\n");

return 0;
}

Kevin_D_Intel · ‎01-17-2014

In Taylor’s post, nocopy is correct because he shows all the pragmas appearing within the same local scope. In other words, the first pragma with in() updates the pointer on the target AND allocates memory AND transfers the current data; therefore, the second, third and any subsequent pragmas within the same scope need not update the target’s copy of the pointer; therefore, he can use nocopy as he did in the second pragma which was simply intended to reuse the data that was transferred to the card by the previous pragma. His nocopy does not require length() and in his context (with alloc_if(0)) it is ignored.

With respect to what Marco was asking, the key restriction to explicit allocation within offloaded code is that memory allocated by the user using malloc or some such API cannot participate in the data transfer pragmas. For the pragmas to be usable, the allocation must be done using the pragmas also. In other words, Marco would have to transfer the values of “p” to the card and memcpy into “t” himself. I discussed this recently related to another user’s post here: http://software.intel.com/en-us/forums/topic/499631

Your example uses pointers within different scopes; therefore, as Ravi indicated, when you change scope you must update the new scope’s corresponding stack pointer variable on the target; therefore the first reference to the pointer in the new scope must not be a nocopy. It must be an in() with an alloc_if(0) and either a non-zero or zero length. Within that same scope, after this in() pragma you would then use nocopy as in Taylor’s code.

So, the use of nocopy depends on the scope and what is needed in terms of allocation, no data transfer, and/or pointer refresh. The FAQ in post #5 contains a discussion about the Local Pointers Versus Pointers Used Across Offloads.

Rather than duplicate a portion of another recent related post, see if the reply here (http://software.intel.com/en-us/forums/topic/499257#comment-1776963) helps further explain this matter.

As a final note, all the focus/discussion here relates to local pointers. For global pointers, once created/updated, there is no need for in() with length(0) when used within different functions since the value is retained in global(static) memory on the target.

Dhairya_M_ · ‎01-18-2014

So Ravi, what you are saying is that for array we should ALWAYS use in(...) because the CPU pointer will always be sent from CPU to MIC and never the other way. We can use in(...) even when in the same scope because the CPU pointer value is the same.

So what do we do when we need to transfer data from a persistent MIC array back to the CPU? From what you said out(...) will just over write the CPU pointer variable. Or should we first do in(A: length(0) REUSE) and then out(A: length(N) REUSE)

Ravi_N_Intel · ‎01-20-2014

You can always use in(..) in the same scope but increase the amount of data transferred by 8bytes when it could have been avoided.

The host pointer is the master pointer and will not be overwritten. You can only update data from MIC to what the host pointer points to.

King_Crimson · ‎07-30-2015

IMHO, all the complexities and inelegance would be gone if a high-level version of COI (without pragma) could be made by Intel.

Data persistence between offloads