Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Starting Guide

Swapnil_J_
Beginner
2,803 Views

Hello Firend,

      I am new in Xe_sudio composer of intel. I have good knowlege of Parallel Programing on GPU with CUDA and OPenCL. I want to learen intel xe composer icc , mkl & ipp. I have read all installtion guide and tutorial. But Can any one suggest me how will i start programing.

That Means,

How I will use single core and multiple core of my processor.

How will i divide my execution on diffrent cores.

Please Help me! I am using Intel i7 Processor.

Thanks,

0 Kudos
48 Replies
Swapnil_J_
Beginner
776 Views
Thanks, My Friend jimdempseyatthecove !!!
0 Kudos
Swapnil_J_
Beginner
776 Views
Hi Friend, I have used following function: double *A = (double *) malloc (SIZE*sizeof(double)); result = cblas_dasum (vec_size, A, incX); but How will i use? above function with int data type and How will i use void? int *A = (int *) malloc (SIZE*sizeof(int)); result = cblas_dasum (vec_size, A, incX);
0 Kudos
mecej4
Honored Contributor III
776 Views
Swapnil, You cannot apply BLAS/Cblas functions such as 'dasum' to arrays of integers, characters, etc. Even a function returning an integer, such as 'idamax', which returns the index of the largest (in absolute value) element of an array, takes an array argument that must be one of the real/complex types. Jim, Until they fix the forum software so that '<' and '>' do not get devoured in silence, please use '&amplt;' and '&ampgt;' or substitute the double quote,", for the '<' and '>' characters. In your code above, there are a number of header files names that have been blanked out, and it is hard to guess all of them. Thanks.
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
I'd like to follow up... >>... i have used 600,000,000 element and find sum of that number and it was giving result in 140 to 150 ms i want more >>optimzation... Do you mean a faster calculation of a sum of the vector elements? I think for a data set of 572MB a calculation times 140 to 150 ms are not too bad however it is not clear what CPU you're using.
0 Kudos
Swapnil_J_
Beginner
776 Views
Hi Sergey, I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS. Thanks,
0 Kudos
Bernard
Valued Contributor I
776 Views
<<>> It would be interesting if Swapnil could post disassembled code of his vector or array summing function.
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
>>...I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS. I'll do a test on my Dell Precision Mobile ( Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical processors / 16GB ). I simply would like to verify how long it takes to get a sum for a 572MB vector of doubles. To be honest, your numbers 140 - 150ms are impressive ( looks too fast ).
0 Kudos
Bernard
Valued Contributor I
776 Views
>>>I'll do a test on my Dell Precision Mobile>>> Do you have a laptop or desktop computer.
0 Kudos
Bernard
Valued Contributor I
776 Views
>>>To be honest, your numbers 140 - 150ms are impressive ( looks too fast ).>>> Judging by raw performance measured in gflops of single hyperthreaded core when the single logical core is handling integer data(for example loop counter fused cmp/jmp and dec instructions) executed on Port 5 and second logical core is handling double-floating point four scalar 4D component vector addition executed with the help of AVX 256 - bit vector instructions when not saturated it is possible to achieve theoretical throughput of 8 DP flops per cycle on Port 1 i.e ~24 gflops on single core.Multiplied by four physical cores you can reach almost 96 gflops. Here is good article :http://software.intel.com/en-us/forums/topic/291765
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
>>>>I'll do a test on my Dell Precision Mobile... >> >>Do you have a laptop or desktop computer. This is a 15.6-inch laptop.
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
>>>>...I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS. >> >>I'll do a test on my Dell Precision Mobile ( Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical processors / 16GB ). I simply >>would like to verify how long it takes to get a sum for a 572MB vector of doubles. To be honest, your numbers 140 - 150ms are >>impressive ( looks too fast ). I've just completed a set of tests and numbers look right. Here are my results: ... Succesfully Allocated 4.47GB Initializing the array... Done [ Test 1 ] Calculating Sum of 600000000 elements ( Rolled Loops 1-in-1 )... Sum of 600000000 elements calculated in: 0.718000 secs Sum of 600000000 elements: 600000000.000000 Start: 4358917 ticks End : 4359635 ticks [ Test 2 ] Calculating Sum of 600000000 elements ( Unrolled Loops 4-in-1 )... Sum of 600000000 elements calculated in: 0.327000 secs Sum of 600000000 elements: 600000000.000000 Start: 4359635 ticks End : 4359962 ticks ... [ Note ] A test was executed ( forced ) on one CPU ( #3 ) with all C++ compiler optimizations turned off in a one threaded 64-bit application ( without OpenMP ).
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
Here are the source codes of the test: ... int _ARRAY_SIZE = 600000000; double dSum; DWORD dwStart; DWORD dwEnd; double *pdData = NULL; pdData = ( double * )malloc( ( _ARRAY_SIZE * sizeof( double ) ) ); if( pdData != NULL ) { _tprintf( _T("Succesfully Allocated %.2fGB\n"), ( ( double )( _ARRAY_SIZE * sizeof( double ) ) / 1024 / 1024 / 1024 ) ); int i; _tprintf( _T("Initializing the array...\n") ); for( i = 0; i < _ARRAY_SIZE; i++ ) { pdData = 1.0L; } _tprintf( _T("Done\n\n") ); dSum = 0.0L; _tprintf( _T("[ Test 1 ] Calculating Sum of %d elements ( Rolled Loops 1-in-1 )...\n"), _ARRAY_SIZE ); dwStart = ::GetTickCount(); for( i = 0; i < _ARRAY_SIZE; i++ ) { dSum += pdData; } dwEnd = ::GetTickCount(); _tprintf( _T("Sum of %d elements calculated in: %f secs\n"), _ARRAY_SIZE, ( float )( dwEnd - dwStart ) / 1000.0f ); _tprintf( _T("Sum of %d elements: %f\n"), _ARRAY_SIZE, dSum ); _tprintf( _T("Start: %d ticks\nEnd : %d ticks\n\n"), dwStart, dwEnd ); dSum = 0.0L; _tprintf( _T("[ Test 2 ] Calculating Sum of %d elements ( Unrolled Loops 4-in-1 )...\n"), _ARRAY_SIZE ); dwStart = ::GetTickCount(); for( i = 0; i < _ARRAY_SIZE; i+=4 ) { dSum += ( pdData + pdData[i+1] + pdData[i+2] + pdData[i+3] ); } dwEnd = ::GetTickCount(); _tprintf( _T("Sum of %d elements calculated in: %f secs\n"), _ARRAY_SIZE, ( float )( dwEnd - dwStart ) / 1000.0f ); _tprintf( _T("Sum of %d elements: %f\n"), _ARRAY_SIZE, dSum ); _tprintf( _T("Start: %d ticks\nEnd : %d ticks\n\n"), dwStart, dwEnd ); } else { _tprintf( _T("Failed to Allocate %.2fGB\n"), ( ( double )( _ARRAY_SIZE * sizeof( double ) ) / 1024 / 1024 / 1024 ) ); } ...
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
>>...i got max through put time 140 to 150 milliseconds for 600,000,000 (600 million number) >>Can we reduce time for execution time? Unroll your loops manually or with a pragma directive. It is simple, effective and improves performance in at least 2x. You could combine unrolling with OpenMP and it should also improve performance.
0 Kudos
Bernard
Valued Contributor I
776 Views
>>>'ve just completed a set of tests and numbers look right. Here are my results:>>> Can you post dissasembled code of your test case? I'm interested in unrolled loop mainly. Soon I will create a new thread when I plan to compare FFT algorithms compiled by Intel and Microsoft compilers for speed of execution. I have already done very quick comparision between Intel fully optimized setting and Microsoft unoptimized compilers setting and as excepted Intel compiler was faster. This test was done on 2048 sine function elements input data array. I plan to add variuos level of complexity to my test cases and post the results.
0 Kudos
SergeyKostrov
Valued Contributor II
776 Views
>>Can you post dissasembled code of your test case? Iliya, I posted the source codes, right? You could compile it and study in a debugger, etc, right?
0 Kudos
Bernard
Valued Contributor I
776 Views
>>>Iliya, I posted the source codes, right? You could compile it and study in a debugger, etc, right?>>> Yes, but in a next few days probably I will not be able to get access to the Internet(I'm moving to new town).
0 Kudos
Bernard
Valued Contributor I
776 Views
@Sergey Regarding my planned FFT related compiler testing.Do you have any idea for interesting test cases.I mean the size of data sets,what kind of functions to test,usage of random data and so on?
0 Kudos
Swapnil_J_
Beginner
776 Views
Hello Everyone, from last 2 day my internet was not working sorry for giving late response to everyone's post. I have seen that every one interested to see aggregate sum code with optimization It as follow: #include <stdio.h> #include <omp.h> #include <stdlib.h> #define SIZE 2000000000 struct timeval starttime; // start time function implemantation void startTimer() { gettimeofday(&starttime,0); } // end time function implemantation double endTimer() { struct timeval endtime; gettimeofday(&endtime,0); return (endtime.tv_sec - starttime.tv_sec)*1000.0 + (endtime.tv_usec - starttime.tv_usec)/1000.0; } int main () { long int sum=0; int *A = (int *)malloc(sizeof(int)*SIZE); int i; for(i=0;i=1; } startTimer(); #pragma omp parallel for reduction (+:sum) for(i=0;i; } printf("Result = %ld", sum); printf("Total Time Required = %lf ms\n",endTimer()); return 0; } check on every ones system and post every one his own result with system configuration i am interested to see how it work on different system. shell script. icc -O1 -openmp omp_reduction.c -o omp_reduction_O1 icc -O2 -openmp omp_reduction.c -o omp_reduction_O2 icc -O3 -openmp omp_reduction.c -o omp_reduction_03 icc -O -openmp omp_reduction.c -o omp_reduction_O icc -Os -openmp omp_reduction.c -o omp_reduction_Os icc -O0 -openmp omp_reduction.c -o omp_reduction_O0 icc -fast -openmp omp_reduction.c -o omp_reduction_fast icc -Ofast -openmp omp_reduction.c -o omp_reduction_Ofast icc -fno-alias -openmp omp_reduction.c -o omp_reduction_fno_alias icc -fno-fnalias -openmp omp_reduction.c -o omp_reduction_fno_fnalias
0 Kudos
Bernard
Valued Contributor I
776 Views
>>>check on every ones system and post every one his own result with system configuration i am interested to see how it work on different system>>> It can be interested to see the speed of execution when the floating point array is summed.You can take step for example 0.00000001 and sum it.I bet that at least two ports will be used Port1 and Port5. @Swapnil Do you want to participate in my compiler comparision test.I would like to measure speed of execution achieved by Intel and Microsoft compilers.I plan to run various test cases when hard to optimize FFT algorithm is used.
0 Kudos
Swapnil_J_
Beginner
749 Views
@ iliyapolak I like to participate in your compiler comparison test.
0 Kudos
Bernard
Valued Contributor I
749 Views
Swapnil J. wrote:

@ iliyapolak
I like to participate in your compiler comparison test.

Thank you.Later today I will create a new thread solely for the purpose of the FFT testing.
0 Kudos
Reply