Solved: You are right. I missed that

Manuel_D_1 · ‎05-22-2016

Actually I want to offload a function call in an for-loop which I run in parallel with Openmp. The problem is, that I am able to run this loop in parallel or to offload the function call but not to do both at the same time. Every function call fills two arrays which are the only out clauses of the offload.

To run the loop in parallel I have two array of pointers to (int) output arrays, so that every parallel offload can save the output in an own array in according to this page: https://software.intel.com/en-us/articles/xeon-phi-coprocessor-data-transfer-array-of-pointers-using-language-extensions-for-offload

To give a better overview here the important part of the code which runs fine if I delete the offload pragma and run bar on the host:

foo() {
int *outa[threadnum], *outb[threadnum];
... // calc arrsize
   // here I want to insert #pragma omp for schedule...
   for () {
    outa = (int *) malloc (arrsize * sizeof(int)); // x always between 0 and threadnum-1
    outb = (int *) malloc (arrsize * sizeof(int));
    #pragma offload target(mic) \
    in(...)
    out( outa : length(arrsize)
    out( outb : length(arrsize)
    bar (...,outa,outb,...);
    ...
    free(outa);
    free(outb);
   }
}

Is there any obvious problem which I did not realize and leads to a Segmentation error (Happening in the really first offload, with x=0) ? For better comparison, here the part of the Code if I run the for loop not parallel on the phis (working fine):

foo() {
int *outa, *outb;

   for () {
   outa = (int *) malloc(arrsize * sizeof(int);
   outb = (int *) malloc(arrsize * sizeof(int);
   #pragma offload target(mic) \
   in(...)\
   out(outa:length(arrsize))
   out(outb:length(arrsize))
   bar(...outa,outb,...);
   ...
   free(outa);
   free(outb);
}

I appreciate every comment, if necessary I can try to create a Code-snippet to reproduce the error.

Gregg_S_Intel · ‎05-23-2016

You may have more luck expressing this as follows. But even better would be to do the threading on the card.

foo() {
#pragma omp for
for () {
  int *outa = (int *) malloc (arrsize * sizeof(int));
  int *outb = (int *) malloc (arrsize * sizeof(int));
  #pragma offload target(mic) \
  in(...)
  out( outa : length(arrsize) )
  out( outb : length(arrsize) )
  bar (...,outa,outb,...);
  ...

View solution in original post

Kevin_D_Intel · ‎05-23-2016

I'm not sure what might be the issue. Maybe alignment. I can inquire with our Developers.

Am I understanding correctly that the offload w/array of pointers does not yet have the omp enabled (based on code comment about where you want to add the omp pragma)? If omp is active with the offload w/array of pointers, can it be run with a single thread?

Do you have multiple phi cards?

Kevin_D_Intel · ‎05-23-2016

It would help to have a reproducer to investigate and knowing your compiler version (icc -V). Thank you.

Gregg_S_Intel · ‎05-23-2016

You may have more luck expressing this as follows. But even better would be to do the threading on the card.

foo() {
#pragma omp for
for () {
  int *outa = (int *) malloc (arrsize * sizeof(int));
  int *outb = (int *) malloc (arrsize * sizeof(int));
  #pragma offload target(mic) \
  in(...)
  out( outa : length(arrsize) )
  out( outb : length(arrsize) )
  bar (...,outa,outb,...);
  ...

Ravi_N_Intel · ‎05-23-2016

You cannot have a parallel loop enclosing a pragma offload which allocated/transfers same variable.without any synchronization using signal/wait.
If the 1st thread is in the parallel loop is still allocating the memory and transferring the data for the offload pragma the 2nd thread might assume the data is ready and start executing the offload.
One way to avoid this is to allocate/transfer data before the parallel loop and trasfer/deallocate after the parallel loop

eg:
#pragma offload_transfer target(mic:0) nocopy(outa : length(arrsize) alloc_if(1) free_if(0))
#pragma paralllel loop
#pragma offload target

#pragma offload_transfer target(mic:0) out(data : length(size) alloc_if(0) free_if(1))

Gregg_S_Intel · ‎05-24-2016

Ravi,wouldn't the threaded loop be doing multiple offloads, each with its own local outa/outb arrays? (Not that I think it is a good idea...)

Ravi_N_Intel · ‎05-24-2016

You are right. I missed that each thread got its own copy.

Manuel_D_1 · ‎06-01-2016

First of all sorry for the late reply.

@Kevin D:
Yes we have multiple cards but using only one by setting the target to mic:0 does not help, if it is that what you were thinking about.
The first code only runs fine (even in parallel) if I delete the offload pragma, but while I try to offload the code it crashes, with and without the omp parallel pragma.

Anyhow, thanks (a lot) to Gregg S. example the code now runs fine, if I have some free time after finishing the optimization I will have a look at this part again to see if I find an other way to get it to work and post it here.

Offloading an Array of pointer, SIGSEGV