my code takes lot of time to execute and returns incorrect result

ankit_m_ · ‎01-12-2014

Hello,

I am new to programming with MIC cards. I am trying to run a very simple program but it appears that it is taking a long time to offload the data over to the MIC card and also the final output seems to be incorrect, can anyone help me figure out my mistake, please.

#include <iostream>
#include <memory>
#include "omp.h"
#include <malloc.h>

using namespace std;

int main()
{
   int xx=100000;
   int yy=10000;

   unsigned long long size = xx*yy;
   cout << " Simulate Data" << endl;
   cout << "data size " << size*4 << endl;

   int* aa = (int*) malloc(sizeof(int)*size);
   for(unsigned long long ii=0; ii < xx*yy; ++ii)
   {
       aa[ii] =1;
   }

   cout << " start offload " << endl;
   unsigned long long dim = xx*yy;
   #pragma offload target(mic:0) \
   in(aa:length(dim))
   {
       #pragma omp parallel for
       for (unsigned long long ii; ii < xx*yy; ++ii)
       {
           aa[ii] *= 2;
       }
   }

   cout << " offload end " << endl;
   cout << " Result " << aa[10] <<" " << aa[1000] << endl;
   free(aa);

   return 0;
}

Thank you

Sincerely,

AM

jimdempseyatthecove · ‎01-13-2014

You want to use

#pragma offload target(mic:0) \
inout(aa:length(dim))

Note, the first offload has the overhead of transferring the code and initializing the MIC's OpenMP thread pool. Try:

for(int I=0; I<4; ++I) {

cout << " start offload " << endl;
double t0 = omp_get_wtime();
  unsigned long long dim = xx*yy;
   #pragma offload target(mic:0) \
   in(aa:length(dim))
   {
       #pragma omp parallel for
       for (unsigned long long ii; ii < xx*yy; ++ii)
       {
           aa[ii] *= 2;
       }
   }
   double t1 = omp_get_wtime();
    cout << " offload end " << t1 - t0 << endl;
   cout << " Result " << aa[10] <<" " << aa[1000] << endl;
} // for

Jim Dempsey

ankit_m_ · ‎01-13-2014

Thank you very much for your prompt reply Jim. I really appreciate all your help. Now, my program is running correctly however the offload is still too slow.

jimdempseyatthecove · ‎01-14-2014

Please note that the code within your offloaded section is trivial.

Read (vector), multiply (vector), write (vector)

That is all it is doing (other than a little loop overhead)

Your offload code should be performing more work to recover the time to pass the data into and out of the MIC.

Choose something like a textbook matrix multiply as a sample code.

Jim Dempsey