Xeon-Phi asynchronous offload from host openMP parallel region

piyush_s_ · ‎04-24-2014

I am using intel's offload pragmas in host openMP code. The code looks as follows

int s1 = f(a,b,c);

   #prama offload singnal(s1) in (...) out(x:len)
   {
       for (int i = 0; i < len; ++i)
   {
   x = ...
   }
   }

   #pragma omp parallel default(shared)
   {
       #pragma omp for schedule(dynamic) nowait
       for (int i = 0; i < count; ++i)
   {
   /* code */
   }

   #pragma omp for schedule(dynamic)
       for (int j = 0; j < count2; ++j)
   {
   /* code */
   }
   }

   #pragma offload wait(s1)
   {
       /* code */
   }

The code offload calculation of $x$ to MIC. The code keeps itself busy by assining some openMP to CPU cores. The above code works as expected. However, the first offload pragma takes a lot of time and has become the bottleneck. Nevertheless overall , it pays off to offload computation of $x$ to MIC. One way to potentially overcome this latency issue I'm trying is as follows

   int s1 = f(a,b,c);

   #pragma omp parallel default(shared)
   {
       #pragma omp single nowait
       {
           #prama offload singnal(s1) in (...) out(x:len)
           {
               for (int i = 0; i < len; ++i)
           {
           x = ...
           }
           }

}

       #pragma omp for schedule(dynamic) nowait
       for (int i = 0; i < count; ++i)
   {
   /* code */
   }

   #pragma omp for schedule(dynamic)
       for (int j = 0; j < count2; ++j)
   {
   /* code */
   }
   }

   #pragma offload wait(s1)
   {
       /* code */
   }

SO this new code, assigns a thread to do the offload while other openmp threads can be used for other worksharing constructs. However this code doesn't work. I get following error message

device 1 does not have a pending signal for wait(0x1)

Offload report points that above piece of code is the main culprit. One temporary work around is using a constant as signal i.e. signal(0), which works. However, I need a more permanent solution. Can anyone shade light on what is going wrong in my code.

Thanks

Ravi_N_Intel · ‎04-24-2014

The signal is associated with thread that issued the offload. So if some other thread execute the wait then the siganl is not available. Use #pragma omp_master for offload/signal and offload/wait The 1st offload does take time as it has to load the load the binary and all the dependent libraries on MIC and do some setup. You can change this to happen when the host program is loaded by setting env OFFLOAD_INIT=on_start

piyush_s_ · ‎04-24-2014

Thanks for your response. I will try #pragma omp master and let you know.

The above given code is inside a sequential for-loop. and thus "#pragma offload" is executed at each iteration of this sequentia for loop . All the buffers used on MIC are reused on each offload (since malloc are very costly) .

On an average, each asynchronous offload call costs me around 4-5 ms (it is on dual socket westmere system). In actual code, offload is autotuned, i.e. only if the model predict offload might benefit me, then it does offload. This latency of 4-5 ms limits amount of computation that I can do on MIC, thus I am trying to hide this latencies by merging it into openMP work sharing constructs.

piyush_s_ · ‎04-24-2014

I tried using #pragma omp master for both offload call and wait. but I still get the message same error message :(

device 1 does not have a pending signal for wait(0x1)

jimdempseyatthecove · ‎04-25-2014

Try this:

int s1 = f(a,b,c);
#pragma omp parallel default(shared)
{
  int s2=0; // new signal
  #pragma omp single nowait
  {
    s2=1; // my thread got the single
    // s2 -----------------v note address of
    #prama offload singnal(&s2) in (...) out(x:len)
    {
      for (int i = 0; i < len; ++i)
      {
        x = ...
      }    
    }
  }
  #pragma omp for schedule(dynamic) nowait
  for (int i = 0; i < count; ++i)
  {
    /* code */
  }
  #pragma omp for schedule(dynamic) 
  for (int j = 0; j < count2; ++j)
  {
    /* code */
  }
  if(s2)
  {
    // my thread performed the offload
    #pragma offload wait(&s2)
    {
      /* code */
    }
  }
} // end parallel, implicit barrier

Jim Dempsey

jimdempseyatthecove · ‎04-25-2014

In your original code you used the value of s1 as opposed to the address of s1 as used in the examples. The documentation states signal takes a unique integer value, it does not state if there are some reserved values. &s1 (address of s1) may have worked in your original code due to it being greater than the last address of the guard page which starts at 0. Had f(a,b,c) produced one of the reserved values, this too may have caused an issue you observed.

Jim Dempsey

piyush_s_ · ‎04-25-2014

Here is an example give gist of what I am trying to do. I compile it with

icc -O0 -o offload_test -openmp -std=c99 offload_test.c

./offload_test

#include <stdio.h>
#include <assert.h>
#include <omp.h>
#define ONCE alloc_if(1) free_if(1)
#define N 100

int main()
{
	printf("Offload testing 101\n");
	double A,B,C;
	
	for (int i = 0; i < N; ++i)
	{
		A = i;
		B = i*i;
	}
	
	int s1 = 1;


	#pragma omp parallel default(shared)
	{
		#pragma omp master
		{
			printf("Master thread is %d\n",omp_get_thread_num());
			#pragma offload target(mic:1) signal(s1) in(A[0:N] :ONCE) in(B[0:N] : ONCE) out(C[0:N] : ONCE)
			{
				for (int i = 0; i < N; ++i)
				{
					C = A+B;
				}
			}

		}
		#pragma omp for 
		for (int i = 0; i < N; ++i)
		{
			B -= A*A-1;
		}
	}

	#pragma offload target(mic:1) wait(s1)
	{

	}

	for (int i = 0; i < N; ++i)
	{
		assert(C== A*(A+1));
	}
	printf("Returned successfully\n");
	return 0;
}

Ravi_N_Intel · ‎04-25-2014

You have a race condition in your program. When the master thread is transferring B the other threads are modifying B. Try the following program #include #include #include #define ONCE alloc_if(1) free_if(1) #define N 100 int main() { printf("Offload testing 101\n"); double A,B,C; for (int i = 0; i < N; ++i) { A = i; B = i*i; } int s1 = 1; #pragma omp parallel default(shared) { #pragma omp master { printf("thread id is %d\n",omp_get_thread_num()); #pragma offload target(mic:1) signal(&s1) in(A[0:N] :ONCE) in(B[0:N] : ONCE) out(C[0:N] : ONCE) { for (int i = 0; i < N; ++i) { C = A+B; } } } #pragma omp barrier #pragma omp for for (int i = 0; i < N; ++i) { B -= A*A-1; } #pragma omp master { #pragma offload_wait target(mic:1) wait(&s1) } } for (int i = 0; i < N; ++i) { assert(C== A*(A+1)); } printf("Returned successfully\n"); return 0; }

piyush_s_ · ‎04-25-2014

Thanks. It does solves the above problem.