<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Arthur,  in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155872#M7890</link>
    <description>&lt;P&gt;Thank you for your reply!&lt;/P&gt;&lt;P&gt;I will test the code you have posted in your reply and let you know. The reply above is for a previous reply you made.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 05 Apr 2020 14:26:00 GMT</pubDate>
    <dc:creator>Munasinghe__Indula</dc:creator>
    <dc:date>2020-04-05T14:26:00Z</dc:date>
    <item>
      <title>Parallel Code: Why does Intel Pentium is faster than Intel Xeon Phi ?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155863#M7881</link>
      <description>&lt;P&gt;Hello Everyone,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've a CentOS 7.3 system running Intel Xeon Phi 3120A coprocessor with an Intel Pentium Gold G5400 processor. I'm using Intel Compiler - ICC to compile C code included with OpenMP for parallel programming. I tested a simple code that calculates the value of Pi on this system. But the Pentium processor with its maximum thread capacity, which is 4 seems to be way faster than the coprocessor with 228 threads at its full capacity. I know the Pentium cores are faster than Xeon Phi cores, but given the number of thread count the Xeon Phi can provide I still can't understand the reason for this difference.&amp;nbsp;&lt;/P&gt;&lt;P&gt;The code I used is as follows,&amp;nbsp;&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

static long num_steps = 100000;
double step;

int NUM_THREADS=228;
void main()
{
	#pragma offload target (mic:0)
	{
		int i,nthreads; double pi,sum[NUM_THREADS],t1,t2,time = 0.0;
		step = 1.0/(double)num_steps;
		t1 = omp_get_wtime();

			omp_set_num_threads(NUM_THREADS);	
			#pragma omp parallel
			{
				double x;
				int i;
				int ID = omp_get_thread_num();
				int nthrds = omp_get_num_threads();
				if(ID==0) nthreads = nthrds;
				for(i=ID, sum[ID]=0.0; i&amp;lt;num_steps; i=i+nthrds)
				{
                			x = (i+0.5)*step;
                			sum[ID] += 4.0/(1.0+x*x);
				}
			}

		for(i=0,pi=0.0;i&amp;lt;nthreads;i++)pi += sum&lt;I&gt;*step;
		t2 = omp_get_wtime();
		time = t2 - t1; 
		printf("pi value:(%f)\n",pi);
		printf("time spent:(%f)\n",time);
	}
}  &lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;I ran the code on the Pentium Gold processor removing the offload command and got the following result.&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp para_pi_mic.c -o para_pi_mic
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.001684)&lt;/PRE&gt;

&lt;P&gt;Then on the coprocessor, I got the following result,&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qoffload -qopenmp para_pi_mic.c -o para_pi_mic
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.263586)&lt;/PRE&gt;

&lt;P&gt;Here the time denotes the execution time for the parallelized code region.&lt;/P&gt;
&lt;P&gt;Could you explain what's happening here, please.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 08:44:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155863#M7881</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-05T08:44:07Z</dc:date>
    </item>
    <item>
      <title>Munasinghe, Indula, in your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155864#M7882</link>
      <description>&lt;P&gt;Munasinghe, Indula, in your example you actually oversubscribe for an enormously large number of threads.&lt;/P&gt;&lt;P&gt;Please make sure that you do the following:&lt;/P&gt;&lt;P&gt;1. Remove the call to omp_set_num_threads(NUM_THREADS) function;&lt;/P&gt;&lt;P&gt;Here's a complete code listed below:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

static long num_steps = 100000;
double step;

int NUM_THREADS=228;
void main()
{
	#pragma offload target (mic:0)
	{
		int i,nthreads; double pi,sum[NUM_THREADS],t1,t2,time = 0.0;
		step = 1.0/(double)num_steps;
		t1 = omp_get_wtime();

			//omp_set_num_threads(NUM_THREADS);	
			#pragma omp parallel
			{
				double x;
				int i;
				int ID = omp_get_thread_num();
				int nthrds = omp_get_num_threads();
				if(ID==0) nthreads = nthrds;
				for(i=ID, sum[ID]=0.0; i&amp;lt;num_steps; i=i+nthrds)
				{
                			x = (i+0.5)*step;
                			sum[ID] += 4.0/(1.0+x*x);
				}
			}

		for(i=0,pi=0.0;i&amp;lt;nthreads;i++)pi += sum&lt;I&gt;*step;
		t2 = omp_get_wtime();
		time = t2 - t1; 
		printf("pi value:(%f)\n",pi);
		printf("time spent:(%f)\n",time);
	}
}  &lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;2. Please build your code by using the commands below:&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;icc -qopenmp para_pi_mic.c -o para_pi_mic // CPU

icc -qopenmp -qoffload-arch=mic-avx512 -o para_pi_mic para_pi_mic.c // Intel Xeon Phi

&lt;/PRE&gt;

&lt;P&gt;That's all. Have a good day ahead!&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 09:53:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155864#M7882</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T09:53:00Z</dc:date>
    </item>
    <item>
      <title>Hi Arthur, </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155865#M7883</link>
      <description>&lt;P&gt;Hi Arthur,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for answering my question!&amp;nbsp;&lt;/P&gt;&lt;P&gt;Still I face some difficulties.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I removed the directives as you have mentioned. And when I tried to run the code on xeon phi,it still runs on the Pentium gold. The code runs on Xeon phi only when I have the&amp;nbsp; &lt;STRONG&gt;#pragma offload target (mic:0)&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Here's what happens when I tried without the offload directive.&amp;nbsp;&lt;/P&gt;&lt;P&gt;On pentium gold,&amp;nbsp;&lt;/P&gt;
&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.335569)&lt;/PRE&gt;

&lt;P&gt;When try to execute it on xeon phi, removing the offload directive and using avx512&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp -qoffload-arch=mic-avx512 -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.341854)&lt;/PRE&gt;

&lt;P&gt;It still runs on the pentium gold. I observed the CPU usage ehile the code runs for large number of steps and all the processing happens on the pentium gold.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then I tried to keep the offload directive inside the code and tried to use avx512 with that, and I got an error.&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp -qoffload-arch=mic-avx512 -o  para_pi_mic para_pi_mic.c
ld: warning: libcoi_device.so.0, needed by /opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5, not found (try using -rpath or -rpath-link)
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIPerfGetCycleFrequency@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIBufferAddRef@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIPipelineStartExecutingRunFunctions@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIEngineGetIndex@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIBufferReleaseRef@COI_1.0'
/opt/intel/compilers_and_libraries_2017.6.256/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIProcessWaitForShutdown@COI_1.0'
&lt;/PRE&gt;

&lt;P&gt;Could you tell me why this happens and what do I need to do, to get the modified code to run on Xeon Phi?&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 12:03:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155865#M7883</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-05T12:03:58Z</dc:date>
    </item>
    <item>
      <title>Thanks for your reply and</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155866#M7884</link>
      <description>&lt;P&gt;Thanks for your reply and comment. I've got a couple of questions for you:&lt;/P&gt;&lt;P&gt;1. What hardware platform you've initially run your code ?&lt;/P&gt;&lt;P&gt;2. Is your code working on the Intel Xeon Phi and if have you received the warnings listed above ?&lt;/P&gt;&lt;P&gt;Normally, as I've figured out the Intel DevCloud does not support Intel Xeon Phi co-processors in favor of Intel PAC / Intel ARRIA 10GX. So please don't run your code in the Intel DevCloud.&lt;/P&gt;&lt;P&gt;Also, remove the comment of the omp_set_num_threads(...)&amp;nbsp;and try build and run the code regardless of whether the icc compiler gives the warning or not. Also run your code on-premises with your own hardware locally.&lt;/P&gt;&lt;P&gt;As I've already explained the only consideration is that you oversubscribe for too many threads. Just remove omp_set_num_threads(NUM_THREADS)&amp;nbsp; function call.&lt;/P&gt;&lt;P&gt;Finally, when the code has been tested, please get back to me with your ongoing reply. I'd like to know about the result.&lt;/P&gt;&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 12:17:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155866#M7884</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T12:17:00Z</dc:date>
    </item>
    <item>
      <title>Hi Arthur, </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155867#M7885</link>
      <description>&lt;P&gt;Hi Arthur,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I did not observe the performance change when I removed the&amp;nbsp;&lt;STRONG&gt;omp_set_num_threads(NUM_THREADS);&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;And as I have told in the previous reply, I can't use avx512 with&lt;STRONG&gt;&amp;nbsp;#pragma offload target (mic:0),&amp;nbsp;&lt;/STRONG&gt;because it gives an error as shown in the previous reply.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Still the Pentium gold performance is better than the Xeon Phi.&lt;/P&gt;&lt;P&gt;Pentium gold&amp;nbsp;&lt;/P&gt;
&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.063168)&lt;/PRE&gt;

&lt;P&gt;Xeon Phi&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp -qoffload -o  para_pi_mic para_pi_mic.c
[root@localhost codes]# ./para_pi_mic
pi value:(3.141593)
time spent:(0.440111)&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 12:19:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155867#M7885</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-05T12:19:58Z</dc:date>
    </item>
    <item>
      <title>Please try this code:</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155868#M7886</link>
      <description>&lt;P&gt;Please try this code:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

static long num_steps = 100000;
double step;

int NUM_THREADS = 228;
int main()
{
	#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (int i = 0; i &amp;lt; num_steps; i++)
		{
			x = (i + 0.5) * step;
			pi += 4.0 / (1.0 + x * x) * step;
		}
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}&lt;/PRE&gt;

&lt;P&gt;and test its performance. Also please get back to me with your reply.&lt;/P&gt;
&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 12:37:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155868#M7886</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T12:37:00Z</dc:date>
    </item>
    <item>
      <title>What I've exactly done is</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155869#M7887</link>
      <description>&lt;P&gt;What I've exactly done is that I have replaced the OpenMP's parallel workshare construct used in your code with tight loop parallelization. I'd like you check&amp;nbsp;if there's a performance speed-up of your code running it on Intel Xeon Phi co-processor.&lt;/P&gt;&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 12:43:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155869#M7887</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T12:43:12Z</dc:date>
    </item>
    <item>
      <title>Also I recommend you to</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155870#M7888</link>
      <description>&lt;P&gt;Also I recommend you to increase the value of num_steps to 1000000 or 10000000 and then test the performance on both Pentium and Intel Xeon Phi.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Probably for the less number of steps (e.g. 100000), it causes the thread scheduling overhead.&lt;/P&gt;&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 12:53:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155870#M7888</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T12:53:20Z</dc:date>
    </item>
    <item>
      <title>Hi Arthur, </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155871#M7889</link>
      <description>&lt;P&gt;Hi Arthur,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your reply!&lt;/P&gt;&lt;P&gt;1. The system that I ran the code is Intel Pentium Gold G5400 processor, 4GB RAM memory, running on Asus Z370P motherboard. The coprocessor is Intel Xeon Phi 3120A. It's a personal computer, so I run the code locally.&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. Yes, the code runs well on the coprocessor. And I&amp;nbsp;observed through micsmc-gui that all the cores&amp;nbsp;were being used when the code runs on it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes, I can run the compiler with the warnings, but the binary is not generated when I used&amp;nbsp;&lt;STRONG&gt;-qoffload-arch=mic-avx512&amp;nbsp;&lt;/STRONG&gt;along with&amp;nbsp;&lt;STRONG&gt;#pragma offload target (mic:0)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When I remove the offload directive&amp;nbsp;&lt;STRONG&gt;#pragma offload target (mic:0)&lt;/STRONG&gt; from the code, the compiler doesn't give any warning with &lt;STRONG&gt;-qoffload-arch=mic-avx512&lt;/STRONG&gt;, and binary generates successfully. But the code runs only on Pentium gold processor. I printed the thread count to see which processor it runs on. And I get thread count as 4.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 14:24:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155871#M7889</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-05T14:24:35Z</dc:date>
    </item>
    <item>
      <title>Hi Arthur, </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155872#M7890</link>
      <description>&lt;P&gt;Thank you for your reply!&lt;/P&gt;&lt;P&gt;I will test the code you have posted in your reply and let you know. The reply above is for a previous reply you made.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 14:26:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155872#M7890</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-05T14:26:00Z</dc:date>
    </item>
    <item>
      <title>Hi Munasinghe, Indula, I'll</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155873#M7891</link>
      <description>&lt;P&gt;Hi Munasinghe, Indula, I'll be waiting for your reply because it's interesting to me if you've already observed any progress in the performance speed-up on Intel Xeon Phi.&lt;/P&gt;&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 14:52:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155873#M7891</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T14:52:23Z</dc:date>
    </item>
    <item>
      <title>Quote:Munasinghe, Indula</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155874#M7892</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Munasinghe, Indula wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2. Yes, the code runs well on the coprocessor. And I&amp;nbsp;observed through micsmc-gui that all the cores&amp;nbsp;were being used when the code runs on it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;One more question: do you actually mean that the following code running well on the co-processor:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

static long num_steps = 100000;
double step;

int NUM_THREADS = 228;
int main()
{
	#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (int i = 0; i &amp;lt; num_steps; i++)
		{
			x = (i + 0.5) * step;
			pi += 4.0 / (1.0 + x * x) * step;
		}
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}&lt;/PRE&gt;

&lt;P&gt;Is this code running well on the co-processor ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 14:55:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155874#M7892</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T14:55:43Z</dc:date>
    </item>
    <item>
      <title>Also, I've test the code from</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155875#M7893</link>
      <description>&lt;P&gt;Also, I've test the code from my answer on the CPU only. Here're some results:&lt;/P&gt;&lt;P&gt;num_steps = 100000&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;=&amp;gt; execution time =&amp;nbsp;0.035316&lt;/P&gt;&lt;P&gt;num_steps = 1000000000 =&amp;gt; execution time =&amp;nbsp;0.009555 (which is ~ 3.7x faster)&lt;/P&gt;&lt;P&gt;The same dynamic I'm expecting to observe on the Intel Xeon Phi co-processors.&lt;/P&gt;&lt;P&gt;Please notice that the parallel execution of code is useless on the least number of iterations (e.g. num_steps in this particular case).&lt;/P&gt;&lt;P&gt;Thanks, Arthur.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 15:11:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155875#M7893</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T15:11:00Z</dc:date>
    </item>
    <item>
      <title>Hi Arthur, </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155876#M7894</link>
      <description>&lt;P&gt;Hi Arthur,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I ran the code you gave me. It runs well on the coprocessor. Following is the code. I made a small change in your code by moving&amp;nbsp;the &lt;STRONG&gt;int i &lt;/STRONG&gt;outside the parallel section since it gave an error from the compiler(undefined variable). It becomes a global variable to all threads because of that, right?&amp;nbsp;&lt;/P&gt;
&lt;PRE class="brush:bash; class-name:dark;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

static long num_steps = 10000000000;
double step;

int NUM_THREADS = 4;
int main()
{
	//#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		int i;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (i = 0; i &amp;lt; num_steps; i++)
		{
			x = (i + 0.5) * step;
			pi += 4.0 / (1.0 + x * x) * step;
		}
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}
&lt;/PRE&gt;

&lt;P&gt;Following are the performance of the code in Pentium and Xeon phi&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Xeon Phi Coprocessor&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;step_count = 100,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;[root@localhost codes]# icc -qopenmp -qoffload para_pi_mic_arthur.c -o para_pi_mic_arthur
[root@localhost codes]# ./para_pi_mic_arthur
pi value:(3.141593)
time spent:(0.272446)&lt;/PRE&gt;

&lt;P&gt;step_count = 1,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.272476)&lt;/PRE&gt;

&lt;P&gt;step_count = 10,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.273506)&lt;/PRE&gt;

&lt;P&gt;step_count = 100,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.362132)&lt;/PRE&gt;

&lt;P&gt;step_count = 1,000,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.418968)&lt;/PRE&gt;

&lt;P&gt;step_count = 10,000,000,000&amp;nbsp; &lt;STRONG&gt;(The answer for pi value changed)&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(0.560332)
time spent:(0.448750)&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Pentium Gold&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;step_count = 100,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.002026)&lt;/PRE&gt;

&lt;P&gt;step_count = 1,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.003813)&lt;/PRE&gt;

&lt;P&gt;step_count = 10,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.018445)&lt;/PRE&gt;

&lt;P&gt;step_count = 100,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.050985)&lt;/PRE&gt;

&lt;P&gt;step_count = 1,000,000,000&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(3.141593)
time spent:(0.343882)&lt;/PRE&gt;

&lt;P&gt;step_count = 10,000,000,000&amp;nbsp; &lt;STRONG&gt;(The answer for pi value changed)&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;pi value:(0.560332)
time spent:(0.478111)&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Still the Pentium processor is much faster than the Xeon Phi right?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 17:08:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155876#M7894</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-05T17:08:52Z</dc:date>
    </item>
    <item>
      <title>Munasinghe, Indula, finally</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155877#M7895</link>
      <description>&lt;P&gt;Munasinghe, Indula, finally what I can advise you is to set the number of threads to say equal to NUM_THREADS = 500&amp;nbsp;omp_set_num_threads(NUM_THREADS) and run this code in co-processor only and measure the execution time. As far as I can understand, you run this code on the co-processor with the NUM_THREADS variable equal to 4 threads.&lt;/P&gt;&lt;P&gt;Also, please check with Intel Xeon Phi documentation if there're special timing functions to obtain&amp;nbsp;the execution wall time&lt;/P&gt;&lt;P&gt;for code running on the co-processor.&lt;/P&gt;&lt;P&gt;Thanks, Arthur.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Apr 2020 18:51:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155877#M7895</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-05T18:51:11Z</dc:date>
    </item>
    <item>
      <title>Hi Arthur, </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155878#M7896</link>
      <description>&lt;P&gt;Hi Arthur,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your answer.&amp;nbsp;&lt;/P&gt;&lt;P&gt;NUM_THREADS is not a OpenMP directive. Since we have removed the omp_set_num_threads directive from the code, the coprocessor can choose to execute with the maximum number of threads it has right?&lt;/P&gt;&lt;P&gt;Actually I copied the code to above reply just after executing it in the Pentium processor, I used 228 threads when I execute it in the coprocessor.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Apr 2020 06:05:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155878#M7896</guid>
      <dc:creator>Munasinghe__Indula</dc:creator>
      <dc:date>2020-04-06T06:05:37Z</dc:date>
    </item>
    <item>
      <title>Hi Indula, Thanks for your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155879#M7897</link>
      <description>&lt;P&gt;Hi Indula, Thanks for your answer and I'd like to know if something has changed since the last code modification ?&lt;/P&gt;&lt;P&gt;Since the last code modification does not take an effect in the performance speed-up on the co-processor, I suspect that it's a hardware-specific problem. Make sure that you re-install the latest Intel Xeon Phi drivers, available for download at&amp;nbsp;&lt;A href="https://downloadcenter.intel.com/product/75557/Intel-Xeon-Phi-Processors"&gt;https://downloadcenter.intel.com/product/75557/Intel-Xeon-Phi-Processors&lt;/A&gt;. Also, please re-install the specific libraries as it's discussed in&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/xeon-phi-software"&gt;https://software.intel.com/en-us/articles/xeon-phi-software&lt;/A&gt;. Finally, collect the information about the Intel Xeon Phi firmware and upgrade it in case this is not the latest firmware installed.&lt;/P&gt;&lt;P&gt;Actually, I've got no idea of what to advise you so far. Allow me some time to read through the documentation and I will get back to you if I could find the solution.&lt;/P&gt;&lt;P&gt;Thanks. :)&lt;/P&gt;&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Apr 2020 06:27:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155879#M7897</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-06T06:27:34Z</dc:date>
    </item>
    <item>
      <title>Also, here's some of the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155880#M7898</link>
      <description>&lt;P&gt;Also, here's some of the examples of code running in parallel offloaded to Intel Xeon Phi co-processor (&lt;A href="https://www.eecs.umich.edu/courses/eecs570/hw/phi_intro.pdf"&gt;https://www.eecs.umich.edu/courses/eecs570/hw/phi_intro.pdf&lt;/A&gt;).&lt;/P&gt;&lt;P&gt;Specifically, make sure that you do the following:&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdint.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

static uint64_t num_steps = 10000000000;
double step;

int NUM_THREADS = 4;
int main()
{
	//#pragma offload target (mic:0)
	{
		double pi = 0.0, x, t1, t2, time = 0.0;
		uint64_t i;
		step = 1.0 / (double)num_steps;
		t1 = omp_get_wtime();
        
		//#pragma omp parallel for simd private(x) shared(step) reduction(+:pi)
		for (i = 0; i &amp;lt; num_steps; i++)
                {
                      #pragma offload target (mic:0)
  		      {
			       x = (i + 0.5) * step;
          		       pi += 4.0 / (1.0 + x * x) * step;
       		      }
                }
        
		t2 = omp_get_wtime();
		time = t2 - t1;
		printf("pi value:(%f)\n", pi);
		printf("time spent:(%f)\n", time);
	}

	return 0;
}&lt;/PRE&gt;

&lt;P&gt;Give a try and get back to me with the result, please.&lt;/P&gt;
&lt;P&gt;Arthur.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Apr 2020 06:35:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Parallel-Code-Why-does-Intel-Pentium-is-faster-than-Intel-Xeon/m-p/1155880#M7898</guid>
      <dc:creator>ArthurRatz</dc:creator>
      <dc:date>2020-04-06T06:35:00Z</dc:date>
    </item>
  </channel>
</rss>

