- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Firend,
I am new in Xe_sudio composer of intel. I have good knowlege of Parallel Programing on GPU with CUDA and OPenCL. I want to learen intel xe composer icc , mkl & ipp. I have read all installtion guide and tutorial. But Can any one suggest me how will i start programing.
That Means,
How I will use single core and multiple core of my processor.
How will i divide my execution on diffrent cores.
Please Help me! I am using Intel i7 Processor.
Thanks,
Link Copied
48 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, My Friend jimdempseyatthecove !!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Friend,
I have used following function:
double *A = (double *) malloc (SIZE*sizeof(double));
result = cblas_dasum (vec_size, A, incX);
but How will i use? above function with int data type and How will i use void?
int *A = (int *) malloc (SIZE*sizeof(int));
result = cblas_dasum (vec_size, A, incX);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Swapnil,
You cannot apply BLAS/Cblas functions such as 'dasum' to arrays of integers, characters, etc. Even a function returning an integer, such as 'idamax', which returns the index of the largest (in absolute value) element of an array, takes an array argument that must be one of the real/complex types.
Jim,
Until they fix the forum software so that '<' and '>' do not get devoured in silence, please use '&lt;' and '&gt;' or substitute the double quote,", for the '<' and '>' characters. In your code above, there are a number of header files names that have been blanked out, and it is hard to guess all of them.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd like to follow up...
>>... i have used 600,000,000 element and find sum of that number and it was giving result in 140 to 150 ms i want more
>>optimzation...
Do you mean a faster calculation of a sum of the vector elements?
I think for a data set of 572MB a calculation times 140 to 150 ms are not too bad however it is not clear what CPU you're using.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
<<>>
It would be interesting if Swapnil could post disassembled code of his vector or array summing function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS.
I'll do a test on my Dell Precision Mobile ( Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical processors / 16GB ). I simply would like to verify how long it takes to get a sum for a 572MB vector of doubles. To be honest, your numbers 140 - 150ms are impressive ( looks too fast ).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I'll do a test on my Dell Precision Mobile>>>
Do you have a laptop or desktop computer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>To be honest, your numbers 140 - 150ms are impressive ( looks too fast ).>>>
Judging by raw performance measured in gflops of single hyperthreaded core when the single logical core is handling integer data(for example loop counter fused cmp/jmp and dec instructions) executed on Port 5 and second logical core is handling double-floating point four scalar 4D component vector addition executed with the help of AVX 256 - bit vector instructions when not saturated it is possible to achieve theoretical throughput of 8 DP flops per cycle on Port 1 i.e ~24 gflops on single core.Multiplied by four physical cores you can reach almost 96 gflops.
Here is good article :http://software.intel.com/en-us/forums/topic/291765
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>>I'll do a test on my Dell Precision Mobile...
>>
>>Do you have a laptop or desktop computer.
This is a 15.6-inch laptop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>>...I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS.
>>
>>I'll do a test on my Dell Precision Mobile ( Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical processors / 16GB ). I simply
>>would like to verify how long it takes to get a sum for a 572MB vector of doubles. To be honest, your numbers 140 - 150ms are
>>impressive ( looks too fast ).
I've just completed a set of tests and numbers look right. Here are my results:
...
Succesfully Allocated 4.47GB
Initializing the array...
Done
[ Test 1 ] Calculating Sum of 600000000 elements ( Rolled Loops 1-in-1 )...
Sum of 600000000 elements calculated in: 0.718000 secs
Sum of 600000000 elements: 600000000.000000
Start: 4358917 ticks
End : 4359635 ticks
[ Test 2 ] Calculating Sum of 600000000 elements ( Unrolled Loops 4-in-1 )...
Sum of 600000000 elements calculated in: 0.327000 secs
Sum of 600000000 elements: 600000000.000000
Start: 4359635 ticks
End : 4359962 ticks
...
[ Note ]
A test was executed ( forced ) on one CPU ( #3 ) with all C++ compiler optimizations turned off in a one threaded 64-bit application ( without OpenMP ).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are the source codes of the test:
...
int _ARRAY_SIZE = 600000000;
double dSum;
DWORD dwStart;
DWORD dwEnd;
double *pdData = NULL;
pdData = ( double * )malloc( ( _ARRAY_SIZE * sizeof( double ) ) );
if( pdData != NULL )
{
_tprintf( _T("Succesfully Allocated %.2fGB\n"), ( ( double )( _ARRAY_SIZE * sizeof( double ) ) / 1024 / 1024 / 1024 ) );
int i;
_tprintf( _T("Initializing the array...\n") );
for( i = 0; i < _ARRAY_SIZE; i++ )
{
pdData = 1.0L;
}
_tprintf( _T("Done\n\n") );
dSum = 0.0L;
_tprintf( _T("[ Test 1 ] Calculating Sum of %d elements ( Rolled Loops 1-in-1 )...\n"), _ARRAY_SIZE );
dwStart = ::GetTickCount();
for( i = 0; i < _ARRAY_SIZE; i++ )
{
dSum += pdData;
}
dwEnd = ::GetTickCount();
_tprintf( _T("Sum of %d elements calculated in: %f secs\n"), _ARRAY_SIZE, ( float )( dwEnd - dwStart ) / 1000.0f );
_tprintf( _T("Sum of %d elements: %f\n"), _ARRAY_SIZE, dSum );
_tprintf( _T("Start: %d ticks\nEnd : %d ticks\n\n"), dwStart, dwEnd );
dSum = 0.0L;
_tprintf( _T("[ Test 2 ] Calculating Sum of %d elements ( Unrolled Loops 4-in-1 )...\n"), _ARRAY_SIZE );
dwStart = ::GetTickCount();
for( i = 0; i < _ARRAY_SIZE; i+=4 )
{
dSum += ( pdData + pdData[i+1] + pdData[i+2] + pdData[i+3] );
}
dwEnd = ::GetTickCount();
_tprintf( _T("Sum of %d elements calculated in: %f secs\n"), _ARRAY_SIZE, ( float )( dwEnd - dwStart ) / 1000.0f );
_tprintf( _T("Sum of %d elements: %f\n"), _ARRAY_SIZE, dSum );
_tprintf( _T("Start: %d ticks\nEnd : %d ticks\n\n"), dwStart, dwEnd );
}
else
{
_tprintf( _T("Failed to Allocate %.2fGB\n"), ( ( double )( _ARRAY_SIZE * sizeof( double ) ) / 1024 / 1024 / 1024 ) );
}
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...i got max through put time 140 to 150 milliseconds for 600,000,000 (600 million number)
>>Can we reduce time for execution time?
Unroll your loops manually or with a pragma directive. It is simple, effective and improves performance in at least 2x. You could combine unrolling with OpenMP and it should also improve performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>'ve just completed a set of tests and numbers look right. Here are my results:>>>
Can you post dissasembled code of your test case?
I'm interested in unrolled loop mainly.
Soon I will create a new thread when I plan to compare FFT algorithms compiled by Intel and Microsoft compilers for speed of execution.
I have already done very quick comparision between Intel fully optimized setting and Microsoft unoptimized compilers setting and as excepted Intel compiler was faster.
This test was done on 2048 sine function elements input data array.
I plan to add variuos level of complexity to my test cases and post the results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Can you post dissasembled code of your test case?
Iliya, I posted the source codes, right? You could compile it and study in a debugger, etc, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Iliya, I posted the source codes, right? You could compile it and study in a debugger, etc, right?>>>
Yes, but in a next few days probably I will not be able to get access to the Internet(I'm moving to new town).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sergey
Regarding my planned FFT related compiler testing.Do you have any idea for interesting test cases.I mean the size of data sets,what kind of functions to test,usage of random data and so on?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Everyone,
from last 2 day my internet was not working sorry for giving late response to everyone's post.
I have seen that every one interested to see aggregate sum code with optimization
It as follow:
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#define SIZE 2000000000
struct timeval starttime;
// start time function implemantation
void startTimer()
{
gettimeofday(&starttime,0);
}
// end time function implemantation
double endTimer()
{
struct timeval endtime;
gettimeofday(&endtime,0);
return (endtime.tv_sec - starttime.tv_sec)*1000.0 + (endtime.tv_usec - starttime.tv_usec)/1000.0;
}
int main ()
{
long int sum=0;
int *A = (int *)malloc(sizeof(int)*SIZE);
int i;
for(i=0;i=1;
}
startTimer();
#pragma omp parallel for reduction (+:sum)
for(i=0;i;
}
printf("Result = %ld", sum);
printf("Total Time Required = %lf ms\n",endTimer());
return 0;
}
check on every ones system and post every one his own result with system configuration i am interested to see how it work on different system.
shell script.
icc -O1 -openmp omp_reduction.c -o omp_reduction_O1
icc -O2 -openmp omp_reduction.c -o omp_reduction_O2
icc -O3 -openmp omp_reduction.c -o omp_reduction_03
icc -O -openmp omp_reduction.c -o omp_reduction_O
icc -Os -openmp omp_reduction.c -o omp_reduction_Os
icc -O0 -openmp omp_reduction.c -o omp_reduction_O0
icc -fast -openmp omp_reduction.c -o omp_reduction_fast
icc -Ofast -openmp omp_reduction.c -o omp_reduction_Ofast
icc -fno-alias -openmp omp_reduction.c -o omp_reduction_fno_alias
icc -fno-fnalias -openmp omp_reduction.c -o omp_reduction_fno_fnalias
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>check on every ones system and post every one his own result with system configuration i am interested to see how it work on different system>>>
It can be interested to see the speed of execution when the floating point array is summed.You can take step for example 0.00000001 and sum it.I bet that at least two ports will be used Port1 and Port5.
@Swapnil
Do you want to participate in my compiler comparision test.I would like to measure speed of execution achieved by Intel and Microsoft compilers.I plan to run various test cases when hard to optimize FFT algorithm is used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ iliyapolak
I like to participate in your compiler comparison test.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Swapnil J. wrote:Thank you.Later today I will create a new thread solely for the purpose of the FFT testing.@ iliyapolak
I like to participate in your compiler comparison test.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page