Re: With OMP is slower than without OMP, why?

bigknife · ‎12-05-2008

Hi,

In my computer, Dell Power Edge 2900 (dual Xeon E5430 CPUs),
Code2 without OMP is as 3 time fast as code1 with OMP. Why?

Thanks.

Peter

code1 (with OMP)
static vector ompDblValues(8);

double MinOutputOMP(const vector &outputs)
{
ompDblValues.assign(8, FLT_MAX);
const long n = (long)outputs.size(); //n=220000
#pragma omp parallel for
for (long i=0; i {
long nThread = omp_get_thread_num();
ompDblValues[nThread] = (ompDblValues[nThread] < outputs) ? ompDblValues[nThread] : outputs;
}
double minPositive= FLT_MAX;
for (long i=0; i {
minPositive = (minPositive < ompDblValues) ? minPositive : ompDblValues;
}
return minPositive;
}

code2 (without OMP)
double MinOutput(const vector &outputs)
{
double minPositive= FLT_MAX;
const long n = (long)outputs.size(); //n=220000
for (int i=0; i {
minPositive = (minPositive < outputs) ? minPositive : outputs;
}
return minPositive;
}

Dmitry_Vyukov · ‎12-05-2008

Quoting - bigknife

Hi,

In my computer, Dell Power Edge 2900 (dual Xeon E5430 CPUs),
Code2 without OMP is as 3 time fast as code1 with OMP. Why?

Thanks.

Peter

code1 (with OMP)
static vector ompDblValues(8);

double MinOutputOMP(const vector &outputs)
{
ompDblValues.assign(8, FLT_MAX);
const long n = (long)outputs.size(); //n=220000
#pragma omp parallel for
for (long i=0; i {
long nThread = omp_get_thread_num();
ompDblValues[nThread] = (ompDblValues[nThread] < outputs) ? ompDblValues[nThread] : outputs;
}
double minPositive= FLT_MAX;
for (long i=0; i {
minPositive = (minPositive < ompDblValues) ? minPositive : ompDblValues;
}
return minPositive;
}

code2 (without OMP)
double MinOutput(const vector &outputs)
{
double minPositive= FLT_MAX;
const long n = (long)outputs.size(); //n=220000
for (int i=0; i {
minPositive = (minPositive < outputs) ? minPositive : outputs;
}
return minPositive;
}

Possible reason is that compiler has chosen schedule with granularity of single for-loop iteration. To fix this you must add schedule directive with specified granularity:

#pragma omp parallel for schedule(dymanic, 10000)

Second reason is false-sharing in ompDblValues array. If want to do reduction manually then you must use something like this:

size_t const cache_line_size = 128;

struct X

{

double value;

char pad [cache_line_size];
};

static vector ompDblValues;

Dmitry_Vyukov · ‎12-05-2008

Quoting - bigknife

Code2 without OMP is as 3 time fast as code1 with OMP. Why?

This is Ok ;)

TimP · ‎12-05-2008

Quoting - bigknife

code2 (without OMP)
double MinOutput(const vector &outputs)
{
double minPositive= FLT_MAX;
const long n = (long)outputs.size(); //n=220000
for (int i=0; i {
minPositive = (minPositive < outputs) ? minPositive : outputs;
}
return minPositive;
}

Published code which is successful at parallelizing such an operation gives each thread multiple batches of sufficient length, with the private results from individual batches combined in a critical region. This may not be the only way, but I suspect you will need to considerthe OpenMP syntax I mentioned.

OpenMP Fortran includes a somewhat suitable reduction operator, but you shouldn'tlet the choice of C handicap you.

The following code finds the position of a maximum element in a float array, which is batched into groups of size aa_dim1, in a direct translation of a Fortran double subscripted array. Since C is in use, private is implicit in the definition of variables inside the parallel region. In your case, not saving the position should allow the inner loop to vectorize, and atomic may work in place of critical.

[cpp]      max__ = aa[aa_dim1 + 1];
      xindex = 1;
      yindex = 1;
      i__2 = *n;
      i__3 = *n;
#pragma omp parallel for if(i__2 > 103)
      for (j = 1; j <= i__2; ++j) {
          int indxj=0;
          float maxj=max__;
          for (int i__ = 1; i__ <= i__3; ++i__)
              if (aa[i__ + j * aa_dim1] > maxj){
                  maxj = aa[i__ + j * aa_dim1];
                  indxj = i__;
                  }
#pragma omp critical
            if(maxj > max__) {
                max__= maxj;
                xindex=indxj;
                yindex=j;
                }
        }
[/cpp]

bigknife · ‎12-06-2008

Quoting - Dmitriy Vyukov

I did add schedule directive like "schedule(dymanic)" or "schedule(guide)", itdidnot work. But I have not tried "schedule(dymatic, 10000). I'll try it.

What's the meaning of "false-sharing in ompDblValues array"? I cannot figure it out.
Why should I try something like that:

size_t const cache_line_size = 128;

struct X

{

double value;

char pad [cache_line_size];
};

static vector ompDblValues;

Thanks!

Peter

Possible reason is that compiler has chosen schedule with granularity of single for-loop iteration. To fix this you must add schedule directive with specified granularity:

#pragma omp parallel for schedule(dymanic, 10000)

Second reason is false-sharing in ompDblValues array. If want to do reduction manually then you must use something like this:

size_t const cache_line_size = 128;

struct X

{

double value;

char pad [cache_line_size];
};

static vector ompDblValues;

Dmitry_Vyukov · ‎12-06-2008

Quoting - bigknife

Your post is unreadable, please repost it.

bigknife · ‎12-07-2008

Quoting - Dmitriy Vyukov

Really? But I can read it very well.

The repost is like that:

I did add schedule directive like "schedule(dymanic)"

or "schedule(guide)", it didnot work. But I have not

tried "schedule(dymatic, 10000). I'll try it.

What's the meaning of "false-sharing in ompDblValues

array"? I cannot figure it out.
Why should I try something like that:

size_t const cache_line_size = 128;

struct X

{

double value;

char pad [cache_line_size];
};

static vector ompDblValues;

Thanks!

Peter

Possible reason is that compiler has chosen schedule

with granularity of single for-loop iteration. To fix

this you must add schedule directive with specified

granularity:

#pragma omp parallel for schedule(dymanic, 10000)

Second reason is false-sharing in ompDblValues array.

If want to do reduction manually then you must use

something like this:

size_t const cache_line_size = 128;

struct X

{

double value;

char pad [cache_line_size];
};

static vector ompDblValues;

Your post is unreadable, please repost it.

Dmitry_Vyukov · ‎12-24-2008

When different cores/processors write data to memory locations which are situated in one cache line, this emposes huge performance overheads (hundreds of cycles).

When different cores/processors write to single memory location, it is called [just] sharing.

When different cores/processors write to different memory locations (but still situated in one cache line), it is called false-sharing.

Both things totally destroy performance and scalability on multi-core/multi-processor systems.