Reg: CPU + XeonPhi implementation of algorithm

shiva_rama_krishna_b · ‎09-09-2014

Hi,

I am trying to implement a load balancing algorithm using CPU + XeonPHi. The code snippet looks below

#define NUM_ITER 100
#define WORK_SIZE 10
.
.

for (int i =0; i < NUM_ITER; i++)
{

#pragma omp parallel num_threads(2)
{

#pragma omp for
for(int i= 0; i < WORK_SIZE; i++)
{
       int thread = omp_get_thread_num();
       if(threadId ==0)
       {
                  process 'i' th element on CPU
       }
      if(threadId = 1)
      {
          #pragma offload target(mic) in(inputDataToCompute-i)
          {
                  process 'i' th element on XeonPhi
          }
      }
}
}

}


In the above code snippet, every time the program encounters the parallel region it creates 2 threads. One thread push the work units on to the CPU and the other is on to the XeonPHI. It continues to do the same for few number of iterations.
I have few general questions.
1) If two threads are making offload call , Do the two threads have intial over head(first offload call over head) more? Or the one which calls offload first?
2)As per my understanding #pragma omp parallel num_threads() creates two different threads again in every iteration(which are different from previous iteration).
3) Will i have offload over head in every iteration or just first iteration? 


please some one help me.

Thanks

sivaramakrishna

Vincent_N_ · ‎09-09-2014

I'm curious about what offload call you are using? Can it also be used in plain C/C++ rather than OpenMP in the example?

Charles_C_Intel1 · ‎09-12-2014

The majority of the offload startup overhead will only happen on the first iteration. But if the arguments of your "in" statement are large, they will need to be allocated and transferred every iteration. You can kill the allocation cost by sharing the destination buffer on the Xeon phi between iterations (look into "nocopy and alloc_if/free_if arguments on pragma offload), but the copy time will be hard to make go away.

Charles