Does MIC really run faster than CPU

Sherry_L_ · ‎07-13-2017

Hi!

I compared the speed of CPU and the MIC by running identical C++ programs using openmp(both fully occupied during operation). However, under the release version, the speed of CPU(9.7s) is nearly 3 times faster than the MIC(26.5s). How come!? If the MIC is actually slower than CPU, then what is the point of using it?

The testing code is as follows:

#pragma omp parallel for reduction(+:sum)

for(int i=0; i<100000; i++)

for(int j=0; j<100000; j++)

sum += sqrt(sqrt(j^2+1) + sqrt(sqrt(i^2+1)) + 1);

For MIC, I used offload pragma to run the code.

The MIC I used is:

Intel Xeon Phi Coprpcessor 7120

The CPU I used is:

Genuine Intel(R) CPU @ 1.80GHz 1.80GHz (2 processor)

Hopefully someone can tell me the reason.

jimdempseyatthecove · ‎07-13-2017

The MIC is designed for highly threaded wide vector operations. Try this:

// the following is inside a function callable from host or from inside an offload
void test()
{
const int N = 100000;

double* array = new double;
double sum;

// populate outside of timed region
#pragma omp parallel for
for(int j=0; j<N; ++j)
  array = j;

// Note, at this time the OpenMP thread team has been created
// ... Outside of the timed region (because this is a once-only thing)

// Start timed region
dobule tStart = omp_get_wtime();
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<N; i++) {
    for(int j=0; j<N; j++)
        sum += sqrt(sqrt(array^2+1.0) + sqrt(sqrt(array^2+1.0)) + 1.0);

double tEnd = omp_get_wtime();
std::cout << "Runtime = " << tEnd - tStart << std::endl;
delete [] array;
}

call test from host, and from inside an offload.
Note, the first time an offload is performed, the executable portion of the offload is slurped into the coprocessor. The above removes that overhead from the time region.

Also note, the first parallel region, in host, and in coprocessor, incurs the overhead of instantiating the OpenMP thread pool. The above removes that overhead from the time region.

The compute loop was changed such that it incorporates vector operations. (the type of code generally run on MIC)

Jim Dempsey

TimP · ‎07-13-2017

sum needs initialization prior to omp reduction. In such an over simplified test the compiler has opportunities for shortcuts. Sqrt is not among the more efficient operations for mic knc.

Sherry_L_ · ‎07-20-2017

jimdempseyatthecove wrote:

The MIC is designed for highly threaded wide vector operations. Try this:
// the following is inside a function callable from host or from inside an offload
void test()
{
const int N = 100000;

double* array = new double;
double sum;

// populate outside of timed region
#pragma omp parallel for
for(int j=0; j<N; ++j)
  array = j;

// Note, at this time the OpenMP thread team has been created
// ... Outside of the timed region (because this is a once-only thing)

// Start timed region
dobule tStart = omp_get_wtime();
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<N; i++) {
    for(int j=0; j<N; j++)
        sum += sqrt(sqrt(array^2+1.0) + sqrt(sqrt(array^2+1.0)) + 1.0);

double tEnd = omp_get_wtime();
std::cout << "Runtime = " << tEnd - tStart << std::endl;
delete [] array;
}
call test from host, and from inside an offload.
Note, the first time an offload is performed, the executable portion of the offload is slurped into the coprocessor. The above removes that overhead from the time region.

Also note, the first parallel region, in host, and in coprocessor, incurs the overhead of instantiating the OpenMP thread pool. The above removes that overhead from the time region.

The compute loop was changed such that it incorporates vector operations. (the type of code generally run on MIC)

Jim Dempsey

It seems that putting all the data into an array does not actually help, which is rather confusing. I wonder if my compilation options are not properly set. If so, how should I set them.

Moreover, after I tried setting Optimization option to O2, I noticed a significant improvement in both MIC and CPU. Especially for MIC, the operation duration is reduced from 28sec to 0.9sec, compared with the improvement of CPU, which is only from 9sec to 4.5sec. Can you explain this phenomenon?(after some experiments, it seems that this method does not have similar improvements for other more complicated programs)

TimP · ‎07-20-2017

Even if your host CPU has 4 lanes, you are using only 2 if you select default sse2 isa. On your mic you may be comparing 1 lane full accuracy library function call with no pipelining against 4 lanes pipelined inlined approximate code. If this is a homework problem meant to give you incentive to study what is happening, you won't get the full answer without doing that. Did you pay attention to number of threads?