<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thank you for your response. in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049618#M49360</link>
    <description>&lt;P&gt;Thank you for your response. I tried what you suggested here are the results:&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;before&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;MIC+INTR ~ 5.18 sec&lt;BR /&gt;
	MIC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 4.75 sec&lt;/P&gt;

&lt;P&gt;CPU+INTR ~ 4.63 sec&lt;BR /&gt;
	CPU&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 6.47 sec&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;after&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;MIC+INTR ~ 4.74 sec&lt;BR /&gt;
	MIC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 4.75 sec&lt;/P&gt;

&lt;P&gt;CPU+INTR ~ 4.31 sec&lt;BR /&gt;
	CPU&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 6.47 sec&lt;/P&gt;

&lt;P&gt;It's better and it still doesn't outperform the auto-vectorization (for the MIC). Maybe something else is wrong in my code or I forget to specify some pragma directives ?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;GS&lt;/P&gt;</description>
    <pubDate>Wed, 14 Jan 2015 10:44:04 GMT</pubDate>
    <dc:creator>Guillaume_S_</dc:creator>
    <dc:date>2015-01-14T10:44:04Z</dc:date>
    <item>
      <title>Intrinsic function on MIC (512)</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049612#M49354</link>
      <description>&lt;P&gt;Hey everyone,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I'm working on a simple financial application (actually benchmarking CPU vs MIC), the first version of the code is without intrinsics function (the compiler is vectorizing the loops) and I wanted to try with the intrinsics. Here is my problem on the CPU, I can observe a gain of performance of 30% with the m256 intrinsics function (vs the CPU without intrinsics) but on the MIC with the m512 the performance is worst than the MIC without the intrinsics (OpenMP + intrinsics), is it normal ?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I can not post the code because it is too big but I can maybe try to reproduce it on a simple piece of code.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thank you&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;GS&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2015 14:58:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049612#M49354</guid>
      <dc:creator>Guillaume_S_</dc:creator>
      <dc:date>2015-01-13T14:58:51Z</dc:date>
    </item>
    <item>
      <title>Among many possible</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049613#M49355</link>
      <description>&lt;P&gt;Among many possible explanations of your observations :&lt;/P&gt;

&lt;P&gt;Auto-vectorization in the absence of pragma is necessarily more aggressive for mic target. It may vary with host isa or optimization flags.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;intrinsics may inhibit compiler optimization of instructions schedule which is more important for Mic.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;if your code is large investigation may be time consuming. You might compare opt-report4 on your hot spots.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2015 16:03:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049613#M49355</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-01-13T16:03:36Z</dc:date>
    </item>
    <item>
      <title>Thank you for your reponse, I</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049614#M49356</link>
      <description>&lt;P&gt;Thank you for your reponse, I ll try to look into vect report.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I posted an example if you want to reproduce it.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2015 17:49:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049614#M49356</guid>
      <dc:creator>Guillaume_S_</dc:creator>
      <dc:date>2015-01-13T17:49:44Z</dc:date>
    </item>
    <item>
      <title>Hello Guillaume,</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049615#M49357</link>
      <description>&lt;P&gt;Hello Guillaume,&lt;/P&gt;

&lt;P&gt;Can you please share the link where you posted your example. I do not see any attachment on this thread. If in case you do not want to share your code publicly, you can send a private message for us to investigate the issue further.&lt;/P&gt;

&lt;P&gt;Thanks !!&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2015 17:54:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049615#M49357</guid>
      <dc:creator>Sunny_G_Intel</dc:creator>
      <dc:date>2015-01-13T17:54:59Z</dc:date>
    </item>
    <item>
      <title>To compile for MIC +</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049616#M49358</link>
      <description>&lt;P&gt;To compile for MIC + intrinsic: icpc prog.cpp -O3 -openmp -DWITH_INTR&lt;BR /&gt;
	To compile for MIC: icpc prog.cpp -O3 -openmp&lt;/P&gt;

&lt;P&gt;To compile for CPU + intrinsic: icpc prog.cpp -O3 -openmp -DWITH_INTR -no-offload&lt;BR /&gt;
	To compile for CPU: icpc prog.cpp -O3 -openmp -no-offload&lt;/P&gt;

&lt;P&gt;MIC+INTR ~ 5.18 sec&lt;BR /&gt;
	MIC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 4.75 sec&lt;/P&gt;

&lt;P&gt;CPU+INTR ~ 4.63 sec&lt;BR /&gt;
	CPU&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 6.47 sec&lt;/P&gt;

&lt;P&gt;CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz | SandyBridge (2x8cores - 32 threads)&lt;BR /&gt;
	MIC: Intel(R) Xeon Phi(TM) coprocessor x100 family (61 cores - 244 threads)&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;offload.h&amp;gt;
#include &amp;lt;math.h&amp;gt;
#include &amp;lt;immintrin.h&amp;gt;

#define N 2&amp;lt;&amp;lt;17
#define P 2&amp;lt;&amp;lt;14

__declspec(target(mic:0)) void testVctr( double *a, double *b, double *c )
{
	__assume_aligned( a, 64 );
	__assume_aligned( b, 64 );
	__assume_aligned( c, 64 );

	int i;
	int j;
	int k;

	#ifdef WITH_INTR
		#ifdef __MIC__
			__m512d  n1    = _mm512_set1_pd( 1. );
			__m512d  n1024 = _mm512_set1_pd( 1024. );
			__m512d  n230  = _mm512_set1_pd( 230. );
		#else
			__m256d n1    = _mm256_set1_pd( 1. );
			__m256d n1024 = _mm256_set1_pd( 1024. );
			__m256d n230  = _mm256_set1_pd( 230. );
		#endif
	#endif

	#pragma omp parallel for private( i, j, k ) schedule( dynamic )
	for( i=0; i&amp;lt;N; ++i )
	{
		#ifdef WITH_INTR
			#ifdef __MIC__	
				double *A = (double *) _mm_malloc( (size_t)( (8) * sizeof(double) ), 64 );

				__m512d res   = _mm512_setzero_pd(), r0, r1;

				for( j=0; j&amp;lt;P; j+=8 )
				{
					r0 = _mm512_set_pd( b[j+7], b[j+6], b[j+5], b[j+4], b[j+3], b[j+2], b[j+1], b&lt;J&gt; );
					r0 = _mm512_add_pd( r0, n1 );
					r0 = _mm512_div_pd( n1, r0 );
					r0 = _mm512_exp_pd( r0 );
					
					r1 = _mm512_set_pd( c[j+7], c[j+6], c[j+5], c[j+4], c[j+3], c[j+2], c[j+1], c&lt;J&gt; );
					r1 = _mm512_mul_pd( r1, n1024 );
					r1 = _mm512_add_pd( r1, n230 );
					r1 = _mm512_log_pd( r1 );
				
					r0 = _mm512_div_pd( r0, r1 );

					res = _mm512_add_pd( res, r0 );
				}

				_mm512_store_pd( A, res );

				double tmp(0.);
				for( k=0; k&amp;lt;8; ++k )
					tmp += A&lt;K&gt;;

				a&lt;I&gt; = tmp;

				_mm_free( (double *) A );

			#else
				double *A = (double *) _mm_malloc( (size_t)( (4) * sizeof(double) ), 64 );

				__m256d res   = _mm256_setzero_pd(), r0, r1;

				for( j=0; j&amp;lt;P; j+=4 )
				{
					r0 = _mm256_set_pd( b[j+3], b[j+2], b[j+1], b&lt;J&gt; );
					r0 = _mm256_add_pd( r0, n1 );
					r0 = _mm256_div_pd( n1, r0 );
					r0 = _mm256_exp_pd( r0 );
					
					r1 = _mm256_set_pd( c[j+3], c[j+2], c[j+1], c&lt;J&gt; );
					r1 = _mm256_mul_pd( r1, n1024 );
					r1 = _mm256_add_pd( r1, n230 );
					r1 = _mm256_log_pd( r1 );
				
					r0 = _mm256_div_pd( r0, r1 );

					res = _mm256_add_pd( res, r0 );
				}

				_mm256_store_pd( A, res );

				double tmp(0.);
				for( k=0; k&amp;lt;4; ++k )
					tmp += A&lt;K&gt;;

				a&lt;I&gt; = tmp;

				_mm_free( (double *) A );

			#endif
		#else
			double res = 0.;

			for( j=0; j&amp;lt;P; ++j )
			{
				double tmp0 = 1./(b&lt;J&gt;+1.);
				double tmp1 = exp( tmp0 );
				double tmp2 = c&lt;J&gt; * 1024;
				double tmp3 = tmp2 + 230;
				double tmp4 = log( tmp3 );
				double tmp5 = tmp1 / tmp4;
				res += tmp5;
			}

			a&lt;I&gt; = res;
		#endif
	}
}

int main( void )
{
	int i;

	printf("\nOuter loop (N) %d iterations \nInner loop (P) %d iterations\n", N, P );

	double * a = (double *) _mm_malloc( (size_t)( (N) * sizeof(double) ), 64 );
	double * b = (double *) _mm_malloc( (size_t)( (P) * sizeof(double) ), 64 );
	double * c = (double *) _mm_malloc( (size_t)( (P) * sizeof(double) ), 64 ); 

	for( i=0; i&amp;lt;P; ++i )
	{
		b&lt;I&gt; = rand()/RAND_MAX;
		c&lt;I&gt; = rand()/RAND_MAX;
	}
	#pragma offload target( mic : 0 ) \
	out( a : length( N ) align(512) ) \
	in ( b : length( P ) align(512) ) \
	in ( c : length( P ) align(512) )
	testVctr( a, b, c );		

	printf( "\nCheck last result: %f (~ 1.)\n", a[N-1]*2./(P) );

	_mm_free( (double *) a );
	_mm_free( (double *) b );
	_mm_free( (double *) c );
	
	return 0;
}&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/K&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/K&gt;&lt;/J&gt;&lt;/J&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 13 Jan 2015 17:59:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049616#M49358</guid>
      <dc:creator>Guillaume_S_</dc:creator>
      <dc:date>2015-01-13T17:59:00Z</dc:date>
    </item>
    <item>
      <title>Hello,</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049617#M49359</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;Have you considered replacing the lines&lt;/P&gt;
&lt;P&gt;r0 = _mm512_set_pd( b[j+7], b[j+6], b[j+5], b[j+4], b[j+3], b[j+2], b[j+1], b&lt;J&gt; );&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;and&lt;/P&gt;
&lt;P&gt;r1 = _mm512_set_pd( c[j+7], c[j+6], c[j+5], c[j+4], c[j+3], c[j+2], c[j+1], c&lt;J&gt; );&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;by something like&lt;/P&gt;
&lt;P&gt;r0 = _mm512_load_pd( &amp;amp;b&lt;J&gt;);&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;and&lt;/P&gt;
&lt;P&gt;r1 = _mm512_load_pd( &amp;amp;c&lt;J&gt;);&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;respectivelly? By glancing at the code I assume this should be possible.&lt;/P&gt;
&lt;P&gt;Those SET commands can be very expensive (generate many intructions) when compared against a straight memory load for all the elements. And&amp;nbsp;my guess is that the compiler can figure out on the high level code that&amp;nbsp;all the arrays are aligned for the load - since all the declarions are in the same program scope.&lt;/P&gt;
&lt;P&gt;Best,&lt;/P&gt;
&lt;P&gt;Leo.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2015 18:32:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049617#M49359</guid>
      <dc:creator>Leonardo_B_Intel</dc:creator>
      <dc:date>2015-01-13T18:32:22Z</dc:date>
    </item>
    <item>
      <title>Thank you for your response.</title>
      <link>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049618#M49360</link>
      <description>&lt;P&gt;Thank you for your response. I tried what you suggested here are the results:&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;before&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;MIC+INTR ~ 5.18 sec&lt;BR /&gt;
	MIC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 4.75 sec&lt;/P&gt;

&lt;P&gt;CPU+INTR ~ 4.63 sec&lt;BR /&gt;
	CPU&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 6.47 sec&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;after&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;MIC+INTR ~ 4.74 sec&lt;BR /&gt;
	MIC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 4.75 sec&lt;/P&gt;

&lt;P&gt;CPU+INTR ~ 4.31 sec&lt;BR /&gt;
	CPU&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ~ 6.47 sec&lt;/P&gt;

&lt;P&gt;It's better and it still doesn't outperform the auto-vectorization (for the MIC). Maybe something else is wrong in my code or I forget to specify some pragma directives ?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;GS&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jan 2015 10:44:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intrinsic-function-on-MIC-512/m-p/1049618#M49360</guid>
      <dc:creator>Guillaume_S_</dc:creator>
      <dc:date>2015-01-14T10:44:04Z</dc:date>
    </item>
  </channel>
</rss>

