Community
cancel
Showing results for 
Search instead for 
Did you mean: 
maemarcus
Beginner
85 Views

OpenMP no speedup

Hello, I've got several huge loops fashioned as follows :

for (unsigned int k = 1; k < ns_1; k++)
{

for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*_C(UPtr) = quat_dtDivdx * (
*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));
THROW_COURANT(*_C(UPtr)); UPtr++;
uPtr++; u_Ptr++;
}
uPtr++; u_Ptr++;
}
uPtr += nx2; u_Ptr += nx2;
}

Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.

Say, I'd like to add an OpenMP support here. I do the following :

#ifdef _OPENMP
#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)
for (int k = 0; k < ns_2; k++)
{
// Thread-localize data sliders.
double
*loc_UPtr = UPtr + k * nx_1 * ny_2,
*loc_uPtr = uPtr + k * np,
*loc_u_Ptr = u_Ptr + k * np;
// Redefine data sliders.
#define UPtr loc_UPtr
#define uPtr loc_uPtr
#define u_Ptr loc_u_Ptr
#else
for (unsigned int k = 1; k < ns_1; k++)
{
#endif
for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*_C(UPtr) = quat_dtDivdx * (
*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));
THROW_COURANT(*_C(UPtr)); UPtr++;
uPtr++; u_Ptr++;
}
uPtr++; u_Ptr++;
}
uPtr += nx2; u_Ptr += nx2;
#ifdef _OPENMP
// Redefine data sliders.
#undef UPtr
#undef uPtr
#undef u_Ptr
#endif
}

This simple idea came after looking on basic OpenMP examples :

1) set pragma for an outter loop

2) for every slider to create and independent thread-local copy using the preprocessor definitions

OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?

Thanks.

0 Kudos
5 Replies
TimP
Black Belt
85 Views

clock() measures total time used by all threads, so you would have an excellent result if that time doesn't increase with threading. OpenMP provides the function omp_get_wtime() for measuring elapsed time, on Intel compatible platforms, __rdtsc() may be useful.
From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are updated by the other.
maemarcus
Beginner
85 Views

Hello, Tim,

Thanks for reply,

> From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are > updated by the other.

I suppose mythreads are not data-dependent. The generalformula is U = F(u, u_) - here u and u_ are read-only, U is not self-dependent. To make it clear, let me providethe preprocessed source :

double 
*UPtr = this->get_TopLevel()->get_Values(),
*u_Ptr = uFlow->levels[uFlow->levelsCount - 1]->get_Values() +
nx + np,
*uPtr = uFlow->get_TopLevel()->get_Values() + nx + np;

#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)
for (int k = 0; k < ns_2; k++)
{

double
*loc_UPtr = UPtr + k * nx_1 * ny_2,
*loc_uPtr = uPtr + k * np,
*loc_u_Ptr = u_Ptr + k * np;
for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*((loc_UPtr)) = quat_dtDivdx * (
*((loc_u_Ptr)) + *((loc_u_Ptr + 1)) + *((loc_uPtr)) + *((loc_uPtr + 1)));
if (abs(*((loc_UPtr))) > 1e0) throw *((loc_UPtr));; loc_UPtr++;
loc_uPtr++; loc_u_Ptr++;
}
loc_uPtr++; loc_u_Ptr++;
}
loc_uPtr += nx2; loc_u_Ptr += nx2;
}
}

So here, in parallel version, I'm trying to provide each k-iteration with independent sliders copies (names starting with loc_) and corresponding offsets.

Now, about timing. When enclosing the cycle above in clock()-s,the result varies from 0.0149 to 0.016 sec, same for serial and parallel versions. If I change clock()-s to omp_get_wtime(), the result varies from 0.0156 to 0.018 secfor serial and from0.0110 to 0.118 sec for parallel. This timings differ a little from test to test, anyway as for omp_get_wtime()parallel seems to be30% faster than serial.

maemarcus
Beginner
85 Views

I've heard for OpenMP it is worse to parallelize a code containing pointers (like I do) instead of arrays with indexers []. Is it true?
TimP
Black Belt
85 Views

Pointers wouldn't necessarily be a problem if you made them and the loop indices private.
maemarcus
Beginner
85 Views

OK, thanks!
Reply