Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

OpenMP no speedup

maemarcus
Beginner
357 Views

Hello, I've got several huge loops fashioned as follows :

for (unsigned int k = 1; k < ns_1; k++)
{

for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*_C(UPtr) = quat_dtDivdx * (
*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));
THROW_COURANT(*_C(UPtr)); UPtr++;
uPtr++; u_Ptr++;
}
uPtr++; u_Ptr++;
}
uPtr += nx2; u_Ptr += nx2;
}

Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.

Say, I'd like to add an OpenMP support here. I do the following :

#ifdef _OPENMP
#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)
for (int k = 0; k < ns_2; k++)
{
// Thread-localize data sliders.
double
*loc_UPtr = UPtr + k * nx_1 * ny_2,
*loc_uPtr = uPtr + k * np,
*loc_u_Ptr = u_Ptr + k * np;
// Redefine data sliders.
#define UPtr loc_UPtr
#define uPtr loc_uPtr
#define u_Ptr loc_u_Ptr
#else
for (unsigned int k = 1; k < ns_1; k++)
{
#endif
for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*_C(UPtr) = quat_dtDivdx * (
*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));
THROW_COURANT(*_C(UPtr)); UPtr++;
uPtr++; u_Ptr++;
}
uPtr++; u_Ptr++;
}
uPtr += nx2; u_Ptr += nx2;
#ifdef _OPENMP
// Redefine data sliders.
#undef UPtr
#undef uPtr
#undef u_Ptr
#endif
}

This simple idea came after looking on basic OpenMP examples :

1) set pragma for an outter loop

2) for every slider to create and independent thread-local copy using the preprocessor definitions

OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?

Thanks.

0 Kudos
5 Replies
TimP
Honored Contributor III
357 Views
clock() measures total time used by all threads, so you would have an excellent result if that time doesn't increase with threading. OpenMP provides the function omp_get_wtime() for measuring elapsed time, on Intel compatible platforms, __rdtsc() may be useful.
From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are updated by the other.
0 Kudos
maemarcus
Beginner
357 Views

Hello, Tim,

Thanks for reply,

> From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are > updated by the other.

I suppose mythreads are not data-dependent. The generalformula is U = F(u, u_) - here u and u_ are read-only, U is not self-dependent. To make it clear, let me providethe preprocessed source :

double 
*UPtr = this->get_TopLevel()->get_Values(),
*u_Ptr = uFlow->levels[uFlow->levelsCount - 1]->get_Values() +
nx + np,
*uPtr = uFlow->get_TopLevel()->get_Values() + nx + np;

#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)
for (int k = 0; k < ns_2; k++)
{

double
*loc_UPtr = UPtr + k * nx_1 * ny_2,
*loc_uPtr = uPtr + k * np,
*loc_u_Ptr = u_Ptr + k * np;
for (unsigned int j = 1; j < ny_1; j++)
{
for (unsigned int i = 0; i < nx_1; i++)
{
*((loc_UPtr)) = quat_dtDivdx * (
*((loc_u_Ptr)) + *((loc_u_Ptr + 1)) + *((loc_uPtr)) + *((loc_uPtr + 1)));
if (abs(*((loc_UPtr))) > 1e0) throw *((loc_UPtr));; loc_UPtr++;
loc_uPtr++; loc_u_Ptr++;
}
loc_uPtr++; loc_u_Ptr++;
}
loc_uPtr += nx2; loc_u_Ptr += nx2;
}
}

So here, in parallel version, I'm trying to provide each k-iteration with independent sliders copies (names starting with loc_) and corresponding offsets.

Now, about timing. When enclosing the cycle above in clock()-s,the result varies from 0.0149 to 0.016 sec, same for serial and parallel versions. If I change clock()-s to omp_get_wtime(), the result varies from 0.0156 to 0.018 secfor serial and from0.0110 to 0.118 sec for parallel. This timings differ a little from test to test, anyway as for omp_get_wtime()parallel seems to be30% faster than serial.

0 Kudos
maemarcus
Beginner
357 Views
I've heard for OpenMP it is worse to parallelize a code containing pointers (like I do) instead of arrays with indexers []. Is it true?
0 Kudos
TimP
Honored Contributor III
357 Views
Pointers wouldn't necessarily be a problem if you made them and the loop indices private.
0 Kudos
maemarcus
Beginner
357 Views
OK, thanks!
0 Kudos
Reply