Starting Guide - Page 3

Swapnil_J_ · ‎12-12-2012

Hello Firend,

I am new in Xe_sudio composer of intel. I have good knowlege of Parallel Programing on GPU with CUDA and OPenCL. I want to learen intel xe composer icc , mkl & ipp. I have read all installtion guide and tutorial. But Can any one suggest me how will i start programing.

That Means,

How I will use single core and multiple core of my processor.

How will i divide my execution on diffrent cores.

Please Help me! I am using Intel i7 Processor.

Thanks,

Bernard · ‎01-07-2013

Hi Swapnil! I'm posting preliminary results of one of my tests.In this test I compared fully optimized by Intel compiler code to unoptimized version of the same algorithm compiled by Microsoft compiler.This code fills the 4096 element array with the sine values and performs FFT on these values. Here is code #include "stdafx.h" #include #include #include #define SWAP(a,b) temp=(a);(a)=(b);(b)=temp void fourier1(double data[],unsigned long nn, int isign); int _tmain(int argc, _TCHAR* argv[]) { int i,q; LONGLONG start,end; const unsigned long MaxIter = 1e+6; double test[4096]; for( i = 0;i < 4096;i++)test = sin((double)i); start = GetTickCount64(); for(q = 0;q < MaxIter;q++){ fourier1(test,2048,1); } end = GetTickCount64(); printf("Intel compiler testcase start value is %ld \n",start); printf("Intel compiler testcase end value is %ld \n",end); printf("Intel compiler resulting overhead is %ld \n",(end-start)); //for( i = 0;i < 2048;i++)printf("FFT test-case 1, fourier transform of sin() = %.17f\n",test); return 0; } void fourier1(double data[],unsigned long nn,int isign){ unsigned long n,mmax,m,j,istep,i; double wtemp,wr,wpr,wpi,wi,theta,temp,tempi; n = nn<<1; j = 1; for( i = 1;i < n;i += 2){ if(j < i){ SWAP(data,data); SWAP(data[j+1],data[i+1]); } m = n >> 1; while(m >= 2 && j > m){ j -= m; m >>= 1; } j+=m; } mmax = 2; while(n > mmax){ istep=mmax << 1; theta = isign*(6.28318530717959/mmax); wtemp = sin(0.5*theta); wpr = -2.0*wtemp*wtemp; wpi = sin(theta); wr = 1.0; wi = 0.0; for(m = 1;m < mmax;m+=2){ for(i = m;i<=n;i+=istep){ j = i+mmax; temp = wr*data-wi*data[j+1]; tempi = wr*data[j+1]+wi*data; data = data - temp; data[j+1] = data[i+1] - tempi; data += temp; data[i+1] += tempi; } wr = (wtemp=wr)*wpr-wi*wpi+wr; wi = wi*wpr+wtemp*wpi+wi; } mmax = istep; } }

Bernard · ‎01-07-2013

Result for Intel compiler test case(average of 3 consecutive runs). Intel compiler testcase start value is 2915299 Intel compiler testcase end value is 3003331 Intel compiler resulting overhead is 88032 msec So you have ~0.088032 milisecond per one loop cycle. Strangely Microsoft compiler test did not complete in 15000 miliseconds I was forced to terminate it.Another run will be performed with 100k loop iteratations.

Bernard · ‎01-07-2013

Fortunately Microsoft compiler test completed with the whopping 162490 miliseconds per 10000 loop iterations, that's mean 16.249 msec per one loop cycle. Here is the result: Microsoft compiler FFT size 4096 testcase start value is 7108747 Microsoft compiler FFT size 4096 testcase end value is 7271237 Microsoft compiler FFT size 4096 testcase resulting overhead is 162490 msec.

TimP · ‎01-08-2013

I'm mildly curious as to which Microsoft version you consider as "the" Microsoft version. MSVC in VS2012 is the first to make any use of simd instructions, but of course you must specify /arch:SSE2 (preferably AVX) if you wish this in the 32-bit version. Specification of unsigned long rather than int looks like an unnecessary handicap, as well as having differing meaning on non-Windows platforms.

Bernard · ‎01-08-2013

TimP (Intel) wrote:
I'm mildly curious as to which Microsoft version you consider as "the" Microsoft version. MSVC in VS2012 is the first to make any use of simd instructions, but of course you must specify /arch:SSE2 (preferably AVX) if you wish this in the 32-bit version.
Specification of unsigned long rather than int looks like an unnecessary handicap, as well as having differing meaning on non-Windows platforms.

Hi Tim For my test I used Visual Studio 2010 and I choose to completely disable any optimization on the side of VS 2010 C/C++ compiler.It was done solely for the sake of comparision between thos two compilers.As I wrote in my previous post soon I will create a thread when I will test both of the compilers. I was not aware that VS 2010 compiler is not using SIMD vector instruction were optimization setting were choosen.

Bernard · ‎01-08-2013

@Tim It is very strange that VS 2010 compiler did not completed in timely manner 1e6 loop iterations.In order to minimize function call overhead I used trigonometric recurrence to calculate sin and cos harmonics. Here are results for 1e4 loop iterations optimization is enabled. Array filling with 4096 sin function values Starting 1e4 loop iterations [Microsoft VS2010 compiler - optimization on FFT size] 4096 test 3 start value is 8773574 msec [Microsoft VS2010 compiler - optimization on FFT size] 4096 test 3 end value is 8914271 msec And for comparision the same code compiled by the Intel C/C++ compiler. Data array filled with 4096 sine values.Number of loop iterations is 1e4 [Microsoft VS2010 compiler - optimization on FFT size] 4096 test 3 resulting ove rhead is 140697 msec

Bernard · ‎01-08-2013

For comparision the same algorithm operating on the same data set.Loop iterated 1e4 times. The results: Intel compiler testcase start value is 9510476 msec Intel compiler testcase end value is 9511396 msec Intel compiler resulting overhead is 920 msec

For anyone still interested in FFT testing I got new more accurate results.Instead of calling 1e6 times fourier() routine and measuring time of execution I measured with the help of compiler intrinisnc function __rdtsc() first for-loop block(responsible for divding data into odd and even parts) and while loop block(main execution body) of the function.The results were as I stated earlier were more accurate.

For FFT 4096 point sine function transform the speed of execution was ~212145 nanoseconds i.e 212microseconds.

Later I will continue on my evaluation of the various function beign trnasformed and time needed to accomplish that.