OpenMP no speedup

maemarcus · ‎04-20-2008

Hello, I've got several huge loops fashioned as follows :

for (unsigned int k = 1; k < ns_1; k++)
{

for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*_C(UPtr) = quat_dtDivdx * (
*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));
THROW_COURANT(*_C(UPtr)); UPtr++;

uPtr++; u_Ptr++;
}

uPtr++; u_Ptr++;
}

uPtr += nx2; u_Ptr += nx2;

Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.

Say, I'd like to add an OpenMP support here. I do the following :

#ifdef _OPENMP
#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)
for (int k = 0; k < ns_2; k++)
{
// Thread-localize data sliders.
double
*loc_UPtr = UPtr + k * nx_1 * ny_2,

*loc_uPtr = uPtr + k * np,
*loc_u_Ptr = u_Ptr + k * np;

// Redefine data sliders.
#define UPtr loc_UPtr
#define uPtr loc_uPtr
#define u_Ptr loc_u_Ptr
#else
for (unsigned int k = 1; k < ns_1; k++)
{
#endif
for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*_C(UPtr) = quat_dtDivdx * (
*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));
THROW_COURANT(*_C(UPtr)); UPtr++;

uPtr++; u_Ptr++;
}

uPtr++; u_Ptr++;
}

uPtr += nx2; u_Ptr += nx2;

#ifdef _OPENMP
// Redefine data sliders.
#undef UPtr
#undef uPtr
#undef u_Ptr
#endif
}

This simple idea came after looking on basic OpenMP examples :

1) set pragma for an outter loop

2) for every slider to create and independent thread-local copy using the preprocessor definitions

OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?

Thanks.

TimP · ‎04-20-2008

clock() measures total time used by all threads, so you would have an excellent result if that time doesn't increase with threading. OpenMP provides the function omp_get_wtime() for measuring elapsed time, on Intel compatible platforms, __rdtsc() may be useful.
From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are updated by the other.

maemarcus · ‎04-21-2008

Hello, Tim,

Thanks for reply,

> From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are > updated by the other.

I suppose mythreads are not data-dependent. The generalformula is U = F(u, u_) - here u and u_ are read-only, U is not self-dependent. To make it clear, let me providethe preprocessed source :

double 
*UPtr = this->get_TopLevel()->get_Values(),
*u_Ptr = uFlow->levels[uFlow->levelsCount - 1]->get_Values() +
nx + np,
*uPtr = uFlow->get_TopLevel()->get_Values() + nx + np;


#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)
for (int k = 0; k < ns_2; k++)
{

double
*loc_UPtr = UPtr + k * nx_1 * ny_2,

*loc_uPtr = uPtr + k * np,
*loc_u_Ptr = u_Ptr + k * np;

for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*((loc_UPtr)) = quat_dtDivdx * (
*((loc_u_Ptr)) + *((loc_u_Ptr + 1)) + *((loc_uPtr)) + *((loc_uPtr + 1)));
if (abs(*((loc_UPtr))) > 1e0) throw *((loc_UPtr));; loc_UPtr++;

loc_uPtr++; loc_u_Ptr++;
}

loc_uPtr++; loc_u_Ptr++;
}

loc_uPtr += nx2; loc_u_Ptr += nx2;

}
}

So here, in parallel version, I'm trying to provide each k-iteration with independent sliders copies (names starting with loc_) and corresponding offsets.

Now, about timing. When enclosing the cycle above in clock()-s,the result varies from 0.0149 to 0.016 sec, same for serial and parallel versions. If I change clock()-s to omp_get_wtime(), the result varies from 0.0156 to 0.018 secfor serial and from0.0110 to 0.118 sec for parallel. This timings differ a little from test to test, anyway as for omp_get_wtime()parallel seems to be30% faster than serial.

maemarcus · ‎05-04-2008

I've heard for OpenMP it is worse to parallelize a code containing pointers (like I do) instead of arrays with indexers []. Is it true?

TimP · ‎05-04-2008

Pointers wouldn't necessarily be a problem if you made them and the loop indices private.

maemarcus · ‎05-04-2008

OK, thanks!