Parallel Code: Why does Intel Pentium is faster than Intel Xeon Phi ?

Munasinghe__Indula · ‎04-05-2020

Hello Everyone,

I've a CentOS 7.3 system running Intel Xeon Phi 3120A coprocessor with an Intel Pentium Gold G5400 processor. I'm using Intel Compiler - ICC to compile C code included with OpenMP for parallel programming. I tested a simple code that calculates the value of Pi on this system. But the Pentium processor with its maximum thread capacity, which is 4 seems to be way faster than the coprocessor with 228 threads at its full capacity. I know the Pentium cores are faster than Xeon Phi cores, but given the number of thread count the Xeon Phi can provide I still can't understand the reason for this difference.

The code I used is as follows,

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

static long num_steps = 100000;
double step;

int NUM_THREADS=228;
void main()
{
	#pragma offload target (mic:0)
	{
		int i,nthreads; double pi,sum[NUM_THREADS],t1,t2,time = 0.0;
		step = 1.0/(double)num_steps;
		t1 = omp_get_wtime();

			omp_set_num_threads(NUM_THREADS);	
			#pragma omp parallel
			{
				double x;
				int i;
				int ID = omp_get_thread_num();
				int nthrds = omp_get_num_threads();
				if(ID==0) nthreads = nthrds;
				for(i=ID, sum[ID]=0.0; i<num_steps; i=i+nthrds)
				{
                			x = (i+0.5)*step;
                			sum[ID] += 4.0/(1.0+x*x);
				}
			}

		for(i=0,pi=0.0;i<nthreads;i++)pi += sum*step;
		t2 = omp_get_wtime();
		time = t2 - t1; 
		printf("pi value:(%f)\n",pi);
		printf("time spent:(%f)\n",time);
	}
}

I ran the code on the Pentium Gold processor removing the offload command and got the following result.

[root@localhost codes]# icc -qopenmp para_pi_mic.c -o para_pi_mic
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.001684)

Then on the coprocessor, I got the following result,

[root@localhost codes]# icc -qoffload -qopenmp para_pi_mic.c -o para_pi_mic
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.263586)

Here the time denotes the execution time for the parallelized code region.

Could you explain what's happening here, please.

ArthurRatz · ‎04-05-2020

Munasinghe, Indula, in your example you actually oversubscribe for an enormously large number of threads.

Please make sure that you do the following:

1. Remove the call to omp_set_num_threads(NUM_THREADS) function;

Here's a complete code listed below:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

static long num_steps = 100000;
double step;

int NUM_THREADS=228;
void main()
{
	#pragma offload target (mic:0)
	{
		int i,nthreads; double pi,sum[NUM_THREADS],t1,t2,time = 0.0;
		step = 1.0/(double)num_steps;
		t1 = omp_get_wtime();

			//omp_set_num_threads(NUM_THREADS);	
			#pragma omp parallel
			{
				double x;
				int i;
				int ID = omp_get_thread_num();
				int nthrds = omp_get_num_threads();
				if(ID==0) nthreads = nthrds;
				for(i=ID, sum[ID]=0.0; i<num_steps; i=i+nthrds)
				{
                			x = (i+0.5)*step;
                			sum[ID] += 4.0/(1.0+x*x);
				}
			}

		for(i=0,pi=0.0;i<nthreads;i++)pi += sum*step;
		t2 = omp_get_wtime();
		time = t2 - t1; 
		printf("pi value:(%f)\n",pi);
		printf("time spent:(%f)\n",time);
	}
}

2. Please build your code by using the commands below:

icc -qopenmp para_pi_mic.c -o para_pi_mic // CPU

icc -qopenmp -qoffload-arch=mic-avx512 -o para_pi_mic para_pi_mic.c // Intel Xeon Phi

That's all. Have a good day ahead!

Munasinghe__Indula · ‎04-05-2020

Hi Arthur,

Thank you for answering my question!

Still I face some difficulties.

I removed the directives as you have mentioned. And when I tried to run the code on xeon phi,it still runs on the Pentium gold. The code runs on Xeon phi only when I have the #pragma offload target (mic:0)

Here's what happens when I tried without the offload directive.

On pentium gold,

[root@localhost codes]# icc -qopenmp -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.335569)

When try to execute it on xeon phi, removing the offload directive and using avx512

[root@localhost codes]# icc -qopenmp -qoffload-arch=mic-avx512 -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.341854)

It still runs on the pentium gold. I observed the CPU usage ehile the code runs for large number of steps and all the processing happens on the pentium gold.

Then I tried to keep the offload directive inside the code and tried to use avx512 with that, and I got an error.

[root@localhost codes]# icc -qopenmp -qoffload-arch=mic-avx512 -o  para_pi_mic para_pi_mic.c
ld: warning: libcoi_device.so.0, needed by /opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5, not found (try using -rpath or -rpath-link)
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIPerfGetCycleFrequency@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIBufferAddRef@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIPipelineStartExecutingRunFunctions@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIEngineGetIndex@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIBufferReleaseRef@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIProcessWaitForShutdown@COI_1.0'

Could you tell me why this happens and what do I need to do, to get the modified code to run on Xeon Phi?

ArthurRatz · ‎04-05-2020

Thanks for your reply and comment. I've got a couple of questions for you:

1. What hardware platform you've initially run your code ?

2. Is your code working on the Intel Xeon Phi and if have you received the warnings listed above ?

Normally, as I've figured out the Intel DevCloud does not support Intel Xeon Phi co-processors in favor of Intel PAC / Intel ARRIA 10GX. So please don't run your code in the Intel DevCloud.

Also, remove the comment of the omp_set_num_threads(...) and try build and run the code regardless of whether the icc compiler gives the warning or not. Also run your code on-premises with your own hardware locally.

As I've already explained the only consideration is that you oversubscribe for too many threads. Just remove omp_set_num_threads(NUM_THREADS) function call.

Finally, when the code has been tested, please get back to me with your ongoing reply. I'd like to know about the result.

Arthur.

Munasinghe__Indula · ‎04-05-2020

Hi Arthur,

I did not observe the performance change when I removed the omp_set_num_threads(NUM_THREADS);

And as I have told in the previous reply, I can't use avx512 with #pragma offload target (mic:0), because it gives an error as shown in the previous reply.

Still the Pentium gold performance is better than the Xeon Phi.

Pentium gold

[root@localhost codes]# icc -qopenmp -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.063168)

Xeon Phi

[root@localhost codes]# icc -qopenmp -qoffload -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.440111)

ArthurRatz · ‎04-05-2020

Please try this code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

static long num_steps = 100000;
double step;

int NUM_THREADS = 228;
int main()
{
	#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (int i = 0; i < num_steps; i++)
		{
			x = (i + 0.5) * step;
			pi += 4.0 / (1.0 + x * x) * step;
		}
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}

and test its performance. Also please get back to me with your reply.

Arthur.

ArthurRatz · ‎04-05-2020

What I've exactly done is that I have replaced the OpenMP's parallel workshare construct used in your code with tight loop parallelization. I'd like you check if there's a performance speed-up of your code running it on Intel Xeon Phi co-processor.

Arthur.

ArthurRatz · ‎04-05-2020

Also I recommend you to increase the value of num_steps to 1000000 or 10000000 and then test the performance on both Pentium and Intel Xeon Phi.

Probably for the less number of steps (e.g. 100000), it causes the thread scheduling overhead.

Arthur.

Munasinghe__Indula · ‎04-05-2020

Hi Arthur,

Thank you for your reply!

1. The system that I ran the code is Intel Pentium Gold G5400 processor, 4GB RAM memory, running on Asus Z370P motherboard. The coprocessor is Intel Xeon Phi 3120A. It's a personal computer, so I run the code locally.

2. Yes, the code runs well on the coprocessor. And I observed through micsmc-gui that all the cores were being used when the code runs on it.

Yes, I can run the compiler with the warnings, but the binary is not generated when I used -qoffload-arch=mic-avx512 along with #pragma offload target (mic:0)

When I remove the offload directive #pragma offload target (mic:0) from the code, the compiler doesn't give any warning with -qoffload-arch=mic-avx512, and binary generates successfully. But the code runs only on Pentium gold processor. I printed the thread count to see which processor it runs on. And I get thread count as 4.

Munasinghe__Indula · ‎04-05-2020

Thank you for your reply!

I will test the code you have posted in your reply and let you know. The reply above is for a previous reply you made.

ArthurRatz · ‎04-05-2020

Hi Munasinghe, Indula, I'll be waiting for your reply because it's interesting to me if you've already observed any progress in the performance speed-up on Intel Xeon Phi.

Arthur.

ArthurRatz · ‎04-05-2020

Munasinghe, Indula wrote:
2. Yes, the code runs well on the coprocessor. And I observed through micsmc-gui that all the cores were being used when the code runs on it.

One more question: do you actually mean that the following code running well on the co-processor:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

static long num_steps = 100000;
double step;

int NUM_THREADS = 228;
int main()
{
	#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (int i = 0; i < num_steps; i++)
		{
			x = (i + 0.5) * step;
			pi += 4.0 / (1.0 + x * x) * step;
		}
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}

Is this code running well on the co-processor ?

ArthurRatz · ‎04-05-2020

Also, I've test the code from my answer on the CPU only. Here're some results:

num_steps = 100000 => execution time = 0.035316

num_steps = 1000000000 => execution time = 0.009555 (which is ~ 3.7x faster)

The same dynamic I'm expecting to observe on the Intel Xeon Phi co-processors.

Please notice that the parallel execution of code is useless on the least number of iterations (e.g. num_steps in this particular case).

Thanks, Arthur.

Munasinghe__Indula · ‎04-05-2020

Hi Arthur,

I ran the code you gave me. It runs well on the coprocessor. Following is the code. I made a small change in your code by moving the int i outside the parallel section since it gave an error from the compiler(undefined variable). It becomes a global variable to all threads because of that, right?

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

static long num_steps = 10000000000;
double step;

int NUM_THREADS = 4;
int main()
{
	//#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		int i;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (i = 0; i < num_steps; i++)
		{
			x = (i + 0.5) * step;
			pi += 4.0 / (1.0 + x * x) * step;
		}
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}

Following are the performance of the code in Pentium and Xeon phi

Xeon Phi Coprocessor

step_count = 100,000

[root@localhost codes]# icc -qopenmp -qoffload para_pi_mic_arthur.c -o para_pi_mic_arthur
[root@localhost codes]# ./para_pi_mic_arthur
pi value:(3.141593)
time spent:(0.272446)

step_count = 1,000,000

pi value:(3.141593)
time spent:(0.272476)

step_count = 10,000,000

pi value:(3.141593)
time spent:(0.273506)

step_count = 100,000,000

pi value:(3.141593)
time spent:(0.362132)

step_count = 1,000,000,000

pi value:(3.141593)
time spent:(0.418968)

step_count = 10,000,000,000 (The answer for pi value changed)

pi value:(0.560332)
time spent:(0.448750)

Pentium Gold

step_count = 100,000

pi value:(3.141593)
time spent:(0.002026)

step_count = 1,000,000

pi value:(3.141593)
time spent:(0.003813)

step_count = 10,000,000

pi value:(3.141593)
time spent:(0.018445)

step_count = 100,000,000

pi value:(3.141593)
time spent:(0.050985)

step_count = 1,000,000,000

pi value:(3.141593)
time spent:(0.343882)

step_count = 10,000,000,000 (The answer for pi value changed)

pi value:(0.560332)
time spent:(0.478111)

Still the Pentium processor is much faster than the Xeon Phi right?

ArthurRatz · ‎04-05-2020

Munasinghe, Indula, finally what I can advise you is to set the number of threads to say equal to NUM_THREADS = 500 omp_set_num_threads(NUM_THREADS) and run this code in co-processor only and measure the execution time. As far as I can understand, you run this code on the co-processor with the NUM_THREADS variable equal to 4 threads.

Also, please check with Intel Xeon Phi documentation if there're special timing functions to obtain the execution wall time

for code running on the co-processor.

Thanks, Arthur.

Munasinghe__Indula · ‎04-05-2020

Hi Arthur,

Thank you for your answer.

NUM_THREADS is not a OpenMP directive. Since we have removed the omp_set_num_threads directive from the code, the coprocessor can choose to execute with the maximum number of threads it has right?

Actually I copied the code to above reply just after executing it in the Pentium processor, I used 228 threads when I execute it in the coprocessor.

ArthurRatz · ‎04-05-2020

Hi Indula, Thanks for your answer and I'd like to know if something has changed since the last code modification ?

Since the last code modification does not take an effect in the performance speed-up on the co-processor, I suspect that it's a hardware-specific problem. Make sure that you re-install the latest Intel Xeon Phi drivers, available for download at https://downloadcenter.intel.com/product/75557/Intel-Xeon-Phi-Processors. Also, please re-install the specific libraries as it's discussed in https://software.intel.com/en-us/articles/xeon-phi-software. Finally, collect the information about the Intel Xeon Phi firmware and upgrade it in case this is not the latest firmware installed.

Actually, I've got no idea of what to advise you so far. Allow me some time to read through the documentation and I will get back to you if I could find the solution.

Thanks. :)

Arthur.

ArthurRatz · ‎04-05-2020

Also, here's some of the examples of code running in parallel offloaded to Intel Xeon Phi co-processor (https://www.eecs.umich.edu/courses/eecs570/hw/phi_intro.pdf).

Specifically, make sure that you do the following:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <omp.h>

static uint64_t num_steps = 10000000000;
double step;

int NUM_THREADS = 4;
int main()
{
	//#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		uint64_t i;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		//#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (i = 0; i < num_steps; i++)
                {
                      #pragma offload target (mic:0)
  		      {
			       x = (i + 0.5) * step;
          		       pi += 4.0 / (1.0 + x * x) * step;
       		      }
                }
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}

Give a try and get back to me with the result, please.

Arthur.