scalability issue when compiling a simple OpenMP program

Cheng_W_ · ‎06-05-2013

Hello all,

I met a very simple but seems tackling to solve:

Here is the code:

#include <stdio.h>
#include <omp.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char** argv)
{
int i;
int nthreads;
unsigned long int B = 1e9;
unsigned int ai = 56372;
double *array = (double*)malloc(B*sizeof(double));
double *origx = (double*)malloc(B*sizeof(double));
double *filter= (double*)malloc(B*sizeof(double));

if(argc !=2){
printf("ERROR! The input is ./test <# of threads>\n");
exit(-1);
}

nthreads = atoi(argv[1]);

#pragma omp parallel for private(i) num_threads(nthreads)
for(i=0; i<B; i++)
{
array= 0;
origx=i;
filter=i;
}

struct timeval start, end;
gettimeofday(&start, NULL);
double t1 = start.tv_sec + (start.tv_usec/1000000.0);

#pragma omp parallel for private(i) num_threads(nthreads)

for (i=0; i<B; i++){

array = origx[index]*filter;
}

gettimeofday(&end, NULL);
double t2 = end.tv_sec + (end.tv_usec/1000000.0);

printf("The orginal parallel wall time is %lf\n", t2-t1);

free(array);
free(origx);
free(filter);
}

But I found that the performance of scalability when I used intel compiler is pretty poor. I ran it on a Inel sandy bridge 8 core cpu, and the compiler flag I used is just -O3 . The scalability is like only 4x under 8 cores, 2x under 4 cores...

Do you have some comments on how to improve the performance ?

Thanks

Cheng

SergeyKostrov · ‎06-05-2013

>>... >>unsigned long int B = 1e9; >>... >>double *array = ( double * )malloc( B * sizeof(double) ); >>double *origx = ( double * )malloc( B * sizeof(double) ); >>double *filter= ( double * )malloc( B * sizeof(double) ); >>... In total it allocates 22.35GB (!) of memory. >>Do you have some comments on how to improve the performance? There are cache lines ( L3, L2, L1 ) and you always should take their sizes into account. In your case they are significantly smaller ( a couple of MBs ) when compared with amount of memory you've allocated ( 22.35GB ). A different implementation with Loop-Blocking optimization technique, or some "Smart-Partitioning", should increase performance. However, when it comes to large data sets usage of additional physical memory and more CPUs are the only solution. Use VTune and it will provide you with additional information about cache lines usage, etc.

Vladimir_P_1234567890 · ‎06-06-2013

Hello Cheng,

1. First you need to use some of array elements after the loop, otherwise it might be deleted by compiler -O3 option since these are unneeded calculations.
2. There is no enough calculations to get a good scalability due to threading and memory overheads. For the case:

[cpp]

#pragma omp parallel for private(i) num_threads(nthreads)
for (i=0; i<B; i++){
array = origx*filter;
}

[/cpp]

I've got numbers like these

[bash]

-bash-4.1$ for i in {1..8};do ./a.out $i;done
Number of threads 1. The orginal parallel wall time is 2.003409, data is 999999998000000000.000000
Number of threads 2. The orginal parallel wall time is 0.964926, data is 999999998000000000.000000
Number of threads 3. The orginal parallel wall time is 0.737903, data is 999999998000000000.000000
Number of threads 4. The orginal parallel wall time is 0.554210, data is 999999998000000000.000000
Number of threads 5. The orginal parallel wall time is 0.507078, data is 999999998000000000.000000
Number of threads 6. The orginal parallel wall time is 0.424040, data is 999999998000000000.000000
Number of threads 7. The orginal parallel wall time is 0.451722, data is 999999998000000000.000000
Number of threads 8. The orginal parallel wall time is 0.507925, data is 999999998000000000.000000

[/bash]

But in case I've changed multiplication by division that mean more cycles per one calculation:

[cpp]

#pragma omp parallel for private(i) num_threads(nthreads)
for (i=0; i<B; i++){
array = origx/filter;
}

[/cpp]

Numbers look more optimistic:

[bash]

-bash-4.1$ for i in {1..8};do ./a.out $i;done
Number of threads 1. The orginal parallel wall time is 4.281044, data is 1.000000
Number of threads 2. The orginal parallel wall time is 2.138184, data is 1.000000
Number of threads 3. The orginal parallel wall time is 1.431760, data is 1.000000
Number of threads 4. The orginal parallel wall time is 1.074740, data is 1.000000
Number of threads 5. The orginal parallel wall time is 0.859848, data is 1.000000
Number of threads 6. The orginal parallel wall time is 0.716466, data is 1.000000
Number of threads 7. The orginal parallel wall time is 0.620521, data is 1.000000
Number of threads 8. The orginal parallel wall time is 0.549075, data is 1.000000

[/bash]

Hope this helps,
--Vladimir

jimdempseyatthecove · ‎06-06-2013

What Vladimir is illustrating is: more work per memory fetch/store == better scalability

When you reach a memory bandwidth limitation, adding more threads will not help. See if you can rework your code such that you can get more work out of the memory fetch and stores.

You might also want to get the most of of vectorization. This will include aligning arrays to 64-byte boundaries, if appropriate target code to AVX capable processor, if compiler's OpenMP support lacks simd capability then hand block the code:

[cpp]
#pragma omp parallel for num_threads(nthreads)
for (int j=0; j < B; j += 1024){
int iEnd = min(j+1024,B);
#pragma (your pragma here to assure best AVX simd)
for (int i=j; i < iEnd; ++i) {
array = origx*filter;
} }
[/cpp]