A Comparison - Parallel Programming on Xeon Phi vs Pentium Gold Processor

Munasinghe__Indula · ‎04-03-2020

Hello Everyone,

I've a CentOS 7.3 system running Intel Xeon Phi 3120A coprocessor with an Intel Pentium Gold G5400 processor. I'm using Intel Compiler - ICC to compile C code included with OpenMP for parallel programming. I tested a simple code that calculates the value of Pi on this system. But the Pentium processor with its maximum thread capacity, which is 4 seems to be way faster than the coprocessor with 228 threads at its full capacity. I know the Pentium cores are faster than Xeon Phi cores, but given the number of thread count the Xeon Phi can provide I still can't understand the reason for this difference.

The code I used is as follows,

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

static long num_steps = 100000;
double step;

int NUM_THREADS=228;
void main()
{
	#pragma offload target (mic:0)
	{
		int i,nthreads; double pi,sum[NUM_THREADS],t1,t2,time = 0.0;
		step = 1.0/(double)num_steps;
		t1 = omp_get_wtime();

			omp_set_num_threads(NUM_THREADS);	
			#pragma omp parallel
			{
				double x;
				int i;
				int ID = omp_get_thread_num();
				int nthrds = omp_get_num_threads();
				if(ID==0) nthreads = nthrds;
				for(i=ID, sum[ID]=0.0; i<num_steps; i=i+nthrds)
				{
                			x = (i+0.5)*step;
                			sum[ID] += 4.0/(1.0+x*x);
				}
			}

		for(i=0,pi=0.0;i<nthreads;i++)pi += sum*step;
		t2 = omp_get_wtime();
		time = t2 - t1; 
		printf("pi value:(%f)\n",pi);
		printf("time spent:(%f)\n",time);
	}
}

I ran the code on the Pentium Gold processor removing the offload command and got the following result.

[root@localhost codes]# icc -qopenmp para_pi_mic.c -o para_pi_mic
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.001684)

Then on the coprocessor, I got the following result,

[root@localhost codes]# icc -qoffload -qopenmp para_pi_mic.c -o para_pi_mic
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.263586)

Here the time denotes the execution time for the parallelized code region.

Could you explain what's happening here, please.