- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using intel's offload pragmas in host openMP code. The code looks as follows
int s1 = f(a,b,c);
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x = ...
}
}
#pragma omp parallel default(shared)
{
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
The code offload calculation of $x$ to MIC. The code keeps itself busy by assining some openMP to CPU cores. The above code works as expected. However, the first offload pragma takes a lot of time and has become the bottleneck. Nevertheless overall , it pays off to offload computation of $x$ to MIC. One way to potentially overcome this latency issue I'm trying is as follows
int s1 = f(a,b,c);
#pragma omp parallel default(shared)
{
#pragma omp single nowait
{
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x = ...
}
}
}
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
SO this new code, assigns a thread to do the offload while other openmp threads can be used for other worksharing constructs. However this code doesn't work. I get following error message
device 1 does not have a pending signal for wait(0x1)
Offload report points that above piece of code is the main culprit. One temporary work around is using a constant as signal i.e. signal(0), which works. However, I need a more permanent solution. Can anyone shade light on what is going wrong in my code.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response. I will try #pragma omp master and let you know.
The above given code is inside a sequential for-loop. and thus "#pragma offload" is executed at each iteration of this sequentia for loop . All the buffers used on MIC are reused on each offload (since malloc are very costly) .
On an average, each asynchronous offload call costs me around 4-5 ms (it is on dual socket westmere system). In actual code, offload is autotuned, i.e. only if the model predict offload might benefit me, then it does offload. This latency of 4-5 ms limits amount of computation that I can do on MIC, thus I am trying to hide this latencies by merging it into openMP work sharing constructs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried using #pragma omp master for both offload call and wait. but I still get the message same error message :(
device 1 does not have a pending signal for wait(0x1)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this:
int s1 = f(a,b,c); #pragma omp parallel default(shared) { int s2=0; // new signal #pragma omp single nowait { s2=1; // my thread got the single // s2 -----------------v note address of #prama offload singnal(&s2) in (...) out(x:len) { for (int i = 0; i < len; ++i) { x = ... } } } #pragma omp for schedule(dynamic) nowait for (int i = 0; i < count; ++i) { /* code */ } #pragma omp for schedule(dynamic) for (int j = 0; j < count2; ++j) { /* code */ } if(s2) { // my thread performed the offload #pragma offload wait(&s2) { /* code */ } } } // end parallel, implicit barrier
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your original code you used the value of s1 as opposed to the address of s1 as used in the examples. The documentation states signal takes a unique integer value, it does not state if there are some reserved values. &s1 (address of s1) may have worked in your original code due to it being greater than the last address of the guard page which starts at 0. Had f(a,b,c) produced one of the reserved values, this too may have caused an issue you observed.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is an example give gist of what I am trying to do. I compile it with
icc -O0 -o offload_test -openmp -std=c99 offload_test.c
./offload_test
#include <stdio.h> #include <assert.h> #include <omp.h> #define ONCE alloc_if(1) free_if(1) #define N 100 int main() { printf("Offload testing 101\n"); double A,B ,C ; for (int i = 0; i < N; ++i) { A = i; B = i*i; } int s1 = 1; #pragma omp parallel default(shared) { #pragma omp master { printf("Master thread is %d\n",omp_get_thread_num()); #pragma offload target(mic:1) signal(s1) in(A[0:N] :ONCE) in(B[0:N] : ONCE) out(C[0:N] : ONCE) { for (int i = 0; i < N; ++i) { C = A+B; } } } #pragma omp for for (int i = 0; i < N; ++i) { B -= A*A-1; } } #pragma offload target(mic:1) wait(s1) { } for (int i = 0; i < N; ++i) { assert(C== A*(A+1)); } printf("Returned successfully\n"); return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. It does solves the above problem.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page