- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In my computer, Dell Power Edge 2900 (dual Xeon E5430 CPUs),
Code2 without OMP is as 3 time fast as code1 with OMP. Why?
Thanks.
Peter
code1 (with OMP)
static vector
double MinOutputOMP(const vector
{
ompDblValues.assign(8, FLT_MAX);
const long n = (long)outputs.size(); //n=220000
#pragma omp parallel for
for (long i=0; i
long nThread = omp_get_thread_num();
ompDblValues[nThread] = (ompDblValues[nThread] < outputs) ? ompDblValues[nThread] : outputs;
}
double minPositive= FLT_MAX;
for (long i=0; i
minPositive = (minPositive < ompDblValues) ? minPositive : ompDblValues;
}
return minPositive;
}
code2 (without OMP)
double MinOutput(const vector
{
double minPositive= FLT_MAX;
const long n = (long)outputs.size(); //n=220000
for (int i=0; i
minPositive = (minPositive < outputs) ? minPositive : outputs;
}
return minPositive;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In my computer, Dell Power Edge 2900 (dual Xeon E5430 CPUs),
Code2 without OMP is as 3 time fast as code1 with OMP. Why?
Thanks.
Peter
code1 (with OMP)
static vector
double MinOutputOMP(const vector
{
ompDblValues.assign(8, FLT_MAX);
const long n = (long)outputs.size(); //n=220000
#pragma omp parallel for
for (long i=0; i
long nThread = omp_get_thread_num();
ompDblValues[nThread] = (ompDblValues[nThread] < outputs) ? ompDblValues[nThread] : outputs;
}
double minPositive= FLT_MAX;
for (long i=0; i
minPositive = (minPositive < ompDblValues) ? minPositive : ompDblValues;
}
return minPositive;
}
code2 (without OMP)
double MinOutput(const vector
{
double minPositive= FLT_MAX;
const long n = (long)outputs.size(); //n=220000
for (int i=0; i
minPositive = (minPositive < outputs) ? minPositive : outputs;
}
return minPositive;
}
Possible reason is that compiler has chosen schedule with granularity of single for-loop iteration. To fix this you must add schedule directive with specified granularity:
#pragma omp parallel for schedule(dymanic, 10000)
Second reason is false-sharing in ompDblValues array. If want to do reduction manually then you must use something like this:
size_t const cache_line_size = 128;
struct X
{
double value;
char pad [cache_line_size];
};
static vector
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
code2 (without OMP)
double MinOutput(const vector
{
double minPositive= FLT_MAX;
const long n = (long)outputs.size(); //n=220000
for (int i=0; i
minPositive = (minPositive < outputs) ? minPositive : outputs;
}
return minPositive;
}
Published code which is successful at parallelizing such an operation gives each thread multiple batches of sufficient length, with the private results from individual batches combined in a critical region. This may not be the only way, but I suspect you will need to considerthe OpenMP syntax I mentioned.
OpenMP Fortran includes a somewhat suitable reduction operator, but you shouldn'tlet the choice of C handicap you.
The following code finds the position of a maximum element in a float array, which is batched into groups of size aa_dim1, in a direct translation of a Fortran double subscripted array. Since C is in use, private is implicit in the definition of variables inside the parallel region. In your case, not saving the position should allow the inner loop to vectorize, and atomic may work in place of critical.
[cpp] max__ = aa[aa_dim1 + 1]; xindex = 1; yindex = 1; i__2 = *n; i__3 = *n; #pragma omp parallel for if(i__2 > 103) for (j = 1; j <= i__2; ++j) { int indxj=0; float maxj=max__; for (int i__ = 1; i__ <= i__3; ++i__) if (aa[i__ + j * aa_dim1] > maxj){ maxj = aa[i__ + j * aa_dim1]; indxj = i__; } #pragma omp critical if(maxj > max__) { max__= maxj; xindex=indxj; yindex=j; } } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did add schedule directive like "schedule(dymanic)" or "schedule(guide)", itdidnot work. But I have not tried "schedule(dymatic, 10000). I'll try it.
What's the meaning of "false-sharing in ompDblValues array"? I cannot figure it out.
Why should I try something like that:
size_t const cache_line_size = 128;
struct X
{
double value;
char pad [cache_line_size];
};
static vector
Thanks!
Peter
Possible reason is that compiler has chosen schedule with granularity of single for-loop iteration. To fix this you must add schedule directive with specified granularity:
#pragma omp parallel for schedule(dymanic, 10000)
Second reason is false-sharing in ompDblValues array. If want to do reduction manually then you must use something like this:
size_t const cache_line_size = 128;
struct X
{
double value;
char pad [cache_line_size];
};
static vector
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Really? But I can read it very well.
The repost is like that:
I did add schedule directive like "schedule(dymanic)"
or "schedule(guide)", it didnot work. But I have not
tried "schedule(dymatic, 10000). I'll try it.
What's the meaning of "false-sharing in ompDblValues
array"? I cannot figure it out.
Why should I try something like that:
size_t const cache_line_size = 128;
struct X
{
double value;
char pad [cache_line_size];
};
static vector
Thanks!
Peter
Possible reason is that compiler has chosen schedule
with granularity of single for-loop iteration. To fix
this you must add schedule directive with specified
granularity:
#pragma omp parallel for schedule(dymanic, 10000)
Second reason is false-sharing in ompDblValues array.
If want to do reduction manually then you must use
something like this:
size_t const cache_line_size = 128;
struct X
{
double value;
char pad [cache_line_size];
};
static vector
Your post is unreadable, please repost it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When different cores/processors write data to memory locations which are situated in one cache line, this emposes huge performance overheads (hundreds of cycles).
When different cores/processors write to single memory location, it is called [just] sharing.
When different cores/processors write to different memory locations (but still situated in one cache line), it is called false-sharing.
Both things totally destroy performance and scalability on multi-core/multi-processor systems.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page