Solved: Please help me with this code segment. It runs slower a lot when offloading to MIC.

王__云飞 · ‎09-10-2014

Code:

#pragma offload target(mic) in(icf_1:length(row * column) alloc_if(1) free_if(0)) nocopy(buff:length(row) alloc_if(1) free_if(0)) nocopy(icfa_1:length(icfa_dim * (icfa_dim + 1) / 2) alloc_if(1) free_if(0)) nocopy(d:length(row) alloc_if(1) free_if(0))

{}

for (; step < parameter.max_iter; ++step) {

//Update the array d

...............

//Code segment I need to parallelize on MIC

#pragma offload target(mic) in(d:length(row) alloc_if(0) free_if(0))

{}

for (int i = 0; i < column; ++i) {

#pragma offload target(mic) nocopy(icf_1) nocopy(d) nocopy(buff)

{

#pragma omp parallel for //num_threads(60)

for (int p = 0; p < row; ++p) {

buff

= d

* icf_1[i* row + p];

}

}

for (int j = 0; j <= i; ++j) {

#pragma offload target(mic) nocopy(icf_1) nocopy(buff) nocopy(icfa_1)

{

double tmp = 0;

#pragma omp parallel for reduction(+:tmp) //num_threads(60)

for (int p = 0; p < row; ++p) {

tmp += buff

* icf_1[j* row + p];

}

icfa_1[icfa_dim * j - j * (j + 1)/2 + i] = tmp;

}

}

}

#pragma offload target(mic) out(icfa_1:length(icfa_dim * (icfa_dim + 1) / 2) alloc_if(0) free_if(0))

{}

//Code segment ends

// Array icfa_1 is used later

...............

}

Explanation:

1. As you can see, the code segment is in a BIG FOR LOOP (for (; step < parameter.max_iter; ++step)), and it loops 38 times in my case.

I monitored the total running time of this code segment, and the time is about 10 times of baseline on server.

2. For the size of the arrays: row = 10k, column = icfa_dim = 1k

3. Data transfer:

Array icf_1, it didn't change during the whole big for loop, so I IN the array icf_1 to MIC before BIG FOR LOOP, when I need it in the code segment , I use NOCOPY;

Array d is updated each time of the BIG FOR LOOP, so I use NOCOPY to allocate space for array d before BIG FOR LOOP, and I in the array d to MIC before the code segment;

Array buff is in MIC all the time, it did't need to use on CPU, so I use NOCOPY to allocate space for array buff before BIG FOR LOOP.

Array icfa_1 is what I want to get in this code segment, and it will be used later. I use NOCOPY to allocate space for array icfa_1 before BIG FOR LOOP. When I update the icfa_1 on MIC, I use NOCOPY. When the code segment ends, I OUT the array icfa_1 to the CPU without freeing the space of icfa_1 on MIC.

4. For -openmp-report and vec-report, it claims that this loop:

#pragma omp parallel for //num_threads(60)

for (int p = 0; p < row; ++p) {

buff

= d

* icf_1[i* row + p];

}

can't be vectorized, it has data dependency, which I have no idea why.

And for this loop:

#pragma omp parallel for reduction(+:tmp) //num_threads(60)

for (int p = 0; p < row; ++p) {

tmp += buff

* icf_1[j* row + p];

}
the report says it is vectorized.

Please help me. I really did everything I could to get to know why this code segment fails to accelerate when offloading to MIC. I really need to know where the problems are. Give me advice please.

Sunny_G_Intel · ‎09-15-2014

Hi Brian,

On further analyzing your application with Intel VTune amplifier, I was able to find out that for your application, the openMP overhead was more than the actual work done by threads in parallel. So I made the following changes to your code to allow only vector based parallelization for your first FOR loop (see line 369 commented) . And I changed the granularity of the work done by openMP threads for 2nd FOR loop (line 387 moved to 380). Can you please use the following code revision and let me know if this helps.

357 #pragma offload target(mic)  in(d:length(row) alloc_if(0) free_if(0)) 
358 {}
359   for (int i = 0; i < column; ++i) {
360     //offset += i;
361     #pragma offload target(mic) nocopy(icf_1) nocopy(d) nocopy(buff) 
362     {
363  // vdMul( row, d, &(icf_1)[i* row], buff );
364  //  #pragma omp parallel for schedule(dynamic) //num_threads(59)
365  /*
366     //ivdep --> ignores vector dependency. 
367  */
368     #pragma ivdep
369  //   #pragma omp parallel for //num_threads(59) 
370         for (int p = 0; p < row; ++p) {
371                 buff = d
 * icf_1[i* row + p];
372         }
373     }
374 
375      /*
376         Pulled the offload section out of outer for loop
377      */
378       #pragma offload target(mic) nocopy(icf_1) nocopy(buff) nocopy(icfa_1)
379       {
380         #pragma omp parallel for schedule(runtime)//num_threads(59)
381         for (int j = 0; j <= i; ++j) {
382         //double tmp = 0;
383         double tmp = 0;
384         /*
385         For your application schedule=runtimes outperforms
386         */
387         //    #pragma omp parallel for reduction(+:tmp) schedule(runtime)//num_threads(59)
388         for (int p = 0; p < row; ++p) {
389         //tmp += buff
 * icf_1[j* row + p];
390         tmp += buff
 * icf_1[j* row + p];
391         }
392         icfa_1[icfa_dim * j - j * (j + 1)/2 + i] = tmp;
393       //icfa[i-j] = tmp + (i == j ? 1 : 0);
394       //result[offset+j] = tmp;
395         }
396     }
397   }

View solution in original post

Sunny_G_Intel · ‎09-10-2014

Hi Brian,

For part of the code which is not auto vectorized and you are certain that there is no dependence, you can try hinting compiler to ignore vector dependence by using pragma (#pragma ivdep).

Also if you don't mind, can you please send me a private message (by clicking send author a message) with your code pasted in proper format. you can use the code pasting option (on main tab - Add or update a code snippet).

Sunny_G_Intel · ‎09-10-2014

Hi Brian,

Thanks for sending me the code snippet.

I would like to point you to some of the optimization techniques which gets crucial when dealing with OpenMP and Intel Xeon Phi. Most importantly I am referring here to is Thread affinity control and efficient loop scheduling. Before going forward we want to ensure that we are not missing on these common techniques.

Thread Affinity Control

OpenMP loop scheduling

Sunny_G_Intel · ‎09-12-2014

Hi Brian,

Please see the following changes in lines (line 359-396) in file pd_ipm.cc

 #pragma offload target(mic)  in(d:length(row) alloc_if(0) free_if(0)) 
{}
  for (int i = 0; i < column; ++i) {
    //offset += i;
    #pragma offload target(mic) nocopy(icf_1) nocopy(d) nocopy(buff) 
    {
    // vdMul( row, d, &(icf_1)[i* row], buff );
  //  #pragma omp parallel for schedule(dynamic) //num_threads(59)
  /*
 //ivdep --> ignores vector dependency. 
 */ 
    #pragma ivdep
    #pragma omp parallel for //num_threads(59) 
    for (int p = 0; p < row; ++p) {
      buff = d
 * icf_1[i* row + p];
    }
    }

     /*
 	Pulled the offload section out of outer for loop
     */
      #pragma offload target(mic) nocopy(icf_1) nocopy(buff) nocopy(icfa_1)
      {
    for (int j = 0; j <= i; ++j) {
      //double tmp = 0;
      double tmp = 0;
      /*
       For your application schedule=runtimes outperforms
      */
      #pragma omp parallel for reduction(+:tmp) schedule(runtime)//num_threads(59)
      for (int p = 0; p < row; ++p) {
        //tmp += buff
 * icf_1[j* row + p];
        tmp += buff
 * icf_1[j* row + p];
      }
      icfa_1[icfa_dim * j - j * (j + 1)/2 + i] = tmp;
      //icfa[i-j] = tmp + (i == j ? 1 : 0);
      //result[offset+j] = tmp;
      }
    }
  }

For your application:

Try to set some of these MIC environment variables:

export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=compact
export PHI_OMP_NUM_THREADS=120

Keep an eye on the core utilization on the card and try to find a sweet spot for your application.

When compiling the program make sure you turn ON auto vectorization (default with -O3 optimization). Also verify if your loops are vectorized wherever possible (-vec-report2)

Let me know if these changes help. It certainly did when I made those changes on my machine.

Sunny_G_Intel · ‎09-15-2014

Hi Brian,

I am surprised that the changes I asked you to do were completely ineffective. Can you please verify that you are running the updated code with the changes. Let me explain you what were the changes I did:

change 1: Bringing the offload call out of the inner loop.

Now this certainly reduces the number of times the you call offload. you can verify this by checking the difference in your output (by forwarding your output to a temporary file with "export OFFLOAD_REPORT=2") in both cases with and without my suggested changes.

change 2: by changing openMP loop scheduling to "runtime"

I definitely see difference in runtime while comparing with "dynamic" scheduling.

change 3: by using KMP_AFFINITY = compact

This would avoid false sharing in your application. you can read about false sharing here.

change 4: using 120 threads

You mentioned in your code that you were trying to use 60 threads. But by default, your offload code running on xeon phi uses all of the 240 threads. This sometimes results into overutilization of cores. Generally for tasks like yours using 120-180 threads should give good performance when using compact affinity. You can see the number of threads being used by running micsmc and checking the card utilization.

change 5: using ivdep with no loop scheduling for code (line 370).

OpenMP loop scheduling of ""dynamic" may prevent compiler to perform auto-vectorization. you can see the difference by compiling with vec-report=2 with and without changes.

Also the for loop for line 386 cannot be auto-vectorized by the compiler because of the inherent sequential task the loop is performing. Reduction based OpenMP pragma can achieve parallelism here but certainly not vectorization. If you want to use vectorization here you might have to divide the tasks into two parts.

First part performing (buff

* icf_1[j* row + p]) using both vectorization and openMP and

the second part performing the reduction operation on tmp variable.

Can you please also let me know your system configuration:

Type of Xeon Phi card, Your host system, Memory used.

And may be any kind of other optimization you are using on non-xeon-phi version.

Sunny_G_Intel · ‎09-15-2014

Hi Brian,

On further analyzing your application with Intel VTune amplifier, I was able to find out that for your application, the openMP overhead was more than the actual work done by threads in parallel. So I made the following changes to your code to allow only vector based parallelization for your first FOR loop (see line 369 commented) . And I changed the granularity of the work done by openMP threads for 2nd FOR loop (line 387 moved to 380). Can you please use the following code revision and let me know if this helps.

357 #pragma offload target(mic)  in(d:length(row) alloc_if(0) free_if(0)) 
358 {}
359   for (int i = 0; i < column; ++i) {
360     //offset += i;
361     #pragma offload target(mic) nocopy(icf_1) nocopy(d) nocopy(buff) 
362     {
363  // vdMul( row, d, &(icf_1)[i* row], buff );
364  //  #pragma omp parallel for schedule(dynamic) //num_threads(59)
365  /*
366     //ivdep --> ignores vector dependency. 
367  */
368     #pragma ivdep
369  //   #pragma omp parallel for //num_threads(59) 
370         for (int p = 0; p < row; ++p) {
371                 buff = d
 * icf_1[i* row + p];
372         }
373     }
374 
375      /*
376         Pulled the offload section out of outer for loop
377      */
378       #pragma offload target(mic) nocopy(icf_1) nocopy(buff) nocopy(icfa_1)
379       {
380         #pragma omp parallel for schedule(runtime)//num_threads(59)
381         for (int j = 0; j <= i; ++j) {
382         //double tmp = 0;
383         double tmp = 0;
384         /*
385         For your application schedule=runtimes outperforms
386         */
387         //    #pragma omp parallel for reduction(+:tmp) schedule(runtime)//num_threads(59)
388         for (int p = 0; p < row; ++p) {
389         //tmp += buff
 * icf_1[j* row + p];
390         tmp += buff
 * icf_1[j* row + p];
391         }
392         icfa_1[icfa_dim * j - j * (j + 1)/2 + i] = tmp;
393       //icfa[i-j] = tmp + (i == j ? 1 : 0);
394       //result[offset+j] = tmp;
395         }
396     }
397   }