Software Archive
Read-only legacy content
17061 Discussions

How to hybrid MIC and CPU without copy-and-paste

SKAL_H_
Beginner
308 Views

Hello,

I know I can hybrid MIC and CPU by using synchronized offload directive.

But I have one question. How to do that without copying and paste codes

For example, there is a vector addition

#pragma omp prallel for
for(i=0;i<N;i++) C = A + B;

I can hybrid it like:

#pragma offload inout(C[0:N/2]: alloc(C[0:N/2])) signal(&sig)
{
#pragma omp parallel for
for(i=0;i<N/2;i++) C = A + B;
}

#pragma omp parallel for
for(i=N/2;i<N;i++) C = A + B;

#pragma offload wait(&sig)

 

But,  is there a way that I don't need to copy-and-paste the code ?

 

 

 

 

0 Kudos
3 Replies
Andrey_Vladimirov
New Contributor III
308 Views

You can use conditional offload clause for that: add "if(condition)" to your #pragma offload. If "condition" evaluates to "true", then offload occurs, and if it evaluates to "false", then instead of offload, the code proceeds to run on the CPU.

[cpp]
const int nDevices = _Offload_number_of_devices() + 1; // Offload to all coprocessors and one host
const int chunkSize = N/nDevices;
assert(N%nDevices == 0); // Consider only nice cases for simplicity
for (int d = 0; d < nDevices; d++) {
  const int start = d*chunkSize; // Start of work chunk
  const int finish = (d+1)*chunkSize; // End of work chunk
  // Offload to mic0, mic1, etc; for d=nDevices-1, do not offload and run on the host.
  #pragma offload target(mic: d) inout(C[start:finish]) if(d<nDevices-1) sinal(sig)
  { // This runs either on one of the coprocessors, or on the host.
    #pragma omp parallel for
    for (int i = start; i < finish; i++)
      C = A + B;
  }
}
[/cpp]

0 Kudos
Andrey_Vladimirov
New Contributor III
308 Views

I prefer to do that using blocking offload from multiple threads. The blocked thread does not use much resources, so you can still schedule work on the host on all threads.

 

const int nDevices = _Offload_number_of_devices() + 1;
const int hostThreads = omp_get_max_threads();
omp_set_nested(1);
#pragma omp parallel for schedule(dynamic, 1) num_threads(nDevices)
for (int d = 0; d < nDevices; d++) {
  const int start = d*chunkSize;
  const int finish = (d+1)*chunkSize;
  #pragma offload target(mic: d) if (d<nDevices-1)
  {
#ifndef __MIC__
    omp_set_num_threads(hostThreads);
#endif
    #pragma omp parallel for
    for (int  i = start; i < finish; i++) {
      // ... do compute-intensive work ...
    }
  }
}

This approach can be generalized to schedule work chunks between devices dynamically (i.e., where you have more work chunks than compute devices). That may be necessary for load balancing: MICs will should complete their work faster than the CPU, and you don't want them idling:

const int nDevices = _Offload_number_of_devices() + 1;
const int hostThreads = omp_get_max_threads();
omp_set_nested(1);
#pragma omp parallel for schedule(dynamic, 1) num_threads(nDevices)
for (int c = 0; c < nChunks; c++) { 
  // Iterating over chunks where nChunks is much greater than nDevices
  const int start = c*chunkSize;
  const int finish = (c+1)*chunkSize;
  const int d = omp_get_thread_num(); // Map threads to compute devices
  #pragma offload target(mic: d) if (d<nDevices-1)
  {
#ifndef __MIC__
    omp_set_num_threads(hostThreads);
#endif
    #pragma omp parallel for
    for (int  i = start; i < finish; i++) {
      // ... do compute-intensive work ...
    }
  }
}

 

0 Kudos
TaylorIoTKidd
New Contributor I
308 Views

What is your intent? If it is to execute part on the coprocessor and part on the host, then your code is reasonable.

If you just want to avoid writing the same code multiple times, you can use an offloaded function

__attribute__((target(mic))) func()
{
   ...
}

And then call it twice, once within an offloaded region and once on the host.

If you want offloaded code to execute on a coprocessor if a coprocessor exists, or on the host if a coprocessor doesn't exist, then you only need the offload statement and don't need to repeat the code or use conditionals. The design of the offload statement is to work whether or not a coprocessor exists. If a coprocessor exists, the offload will execute on a coprocessor. If one doesn't exist, the offload code will execute on the host.

Regards

--
Taylor

 

0 Kudos
Reply