- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
I met a very simple but seems tackling to solve:
Here is the code:
#include <stdio.h>
#include <omp.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char** argv)
{
int i;
int nthreads;
unsigned long int B = 1e9;
unsigned int ai = 56372;
double *array = (double*)malloc(B*sizeof(double));
double *origx = (double*)malloc(B*sizeof(double));
double *filter= (double*)malloc(B*sizeof(double));
if(argc !=2){
printf("ERROR! The input is ./test <# of threads>\n");
exit(-1);
}
nthreads = atoi(argv[1]);
#pragma omp parallel for private(i) num_threads(nthreads)
for(i=0; i<B; i++)
{
array= 0;
origx=i;
filter=i;
}
struct timeval start, end;
gettimeofday(&start, NULL);
double t1 = start.tv_sec + (start.tv_usec/1000000.0);
#pragma omp parallel for private(i) num_threads(nthreads)
for (i=0; i<B; i++){
array = origx[index]*filter;
}
gettimeofday(&end, NULL);
double t2 = end.tv_sec + (end.tv_usec/1000000.0);
printf("The orginal parallel wall time is %lf\n", t2-t1);
free(array);
free(origx);
free(filter);
}
But I found that the performance of scalability when I used intel compiler is pretty poor. I ran it on a Inel sandy bridge 8 core cpu, and the compiler flag I used is just -O3 . The scalability is like only 4x under 8 cores, 2x under 4 cores...
Do you have some comments on how to improve the performance ?
Thanks
Cheng
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Cheng,
1. First you need to use some of array elements after the loop, otherwise it might be deleted by compiler -O3 option since these are unneeded calculations.
2. There is no enough calculations to get a good scalability due to threading and memory overheads. For the case:
[cpp]
#pragma omp parallel for private(i) num_threads(nthreads)
for (i=0; i<B; i++){
array = origx*filter;
}
[/cpp]
I've got numbers like these
[bash]
-bash-4.1$ for i in {1..8};do ./a.out $i;done
Number of threads 1. The orginal parallel wall time is 2.003409, data is 999999998000000000.000000
Number of threads 2. The orginal parallel wall time is 0.964926, data is 999999998000000000.000000
Number of threads 3. The orginal parallel wall time is 0.737903, data is 999999998000000000.000000
Number of threads 4. The orginal parallel wall time is 0.554210, data is 999999998000000000.000000
Number of threads 5. The orginal parallel wall time is 0.507078, data is 999999998000000000.000000
Number of threads 6. The orginal parallel wall time is 0.424040, data is 999999998000000000.000000
Number of threads 7. The orginal parallel wall time is 0.451722, data is 999999998000000000.000000
Number of threads 8. The orginal parallel wall time is 0.507925, data is 999999998000000000.000000
[/bash]
But in case I've changed multiplication by division that mean more cycles per one calculation:
[cpp]
#pragma omp parallel for private(i) num_threads(nthreads)
for (i=0; i<B; i++){
array = origx/filter;
}
[/cpp]
Numbers look more optimistic:
[bash]
-bash-4.1$ for i in {1..8};do ./a.out $i;done
Number of threads 1. The orginal parallel wall time is 4.281044, data is 1.000000
Number of threads 2. The orginal parallel wall time is 2.138184, data is 1.000000
Number of threads 3. The orginal parallel wall time is 1.431760, data is 1.000000
Number of threads 4. The orginal parallel wall time is 1.074740, data is 1.000000
Number of threads 5. The orginal parallel wall time is 0.859848, data is 1.000000
Number of threads 6. The orginal parallel wall time is 0.716466, data is 1.000000
Number of threads 7. The orginal parallel wall time is 0.620521, data is 1.000000
Number of threads 8. The orginal parallel wall time is 0.549075, data is 1.000000
[/bash]
Hope this helps,
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What Vladimir is illustrating is: more work per memory fetch/store == better scalability
When you reach a memory bandwidth limitation, adding more threads will not help. See if you can rework your code such that you can get more work out of the memory fetch and stores.
You might also want to get the most of of vectorization. This will include aligning arrays to 64-byte boundaries, if appropriate target code to AVX capable processor, if compiler's OpenMP support lacks simd capability then hand block the code:
[cpp]
#pragma omp parallel for num_threads(nthreads)
for (int j=0; j < B; j += 1024){
int iEnd = min(j+1024,B);
#pragma (your pragma here to assure best AVX simd)
for (int i=j; i < iEnd; ++i) {
array = origx*filter;
} }
[/cpp]

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page