- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello, I've got several huge loops fashioned as follows :

for (unsigned int k = 1; k < ns_1; k++)

{

for (unsigned int j = 1; j < ny_1; j++)

{

for (unsigned int i = 0; i < nx_1; i++)

{

*_C(UPtr) = quat_dtDivdx * (

*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));

THROW_COURANT(*_C(UPtr)); UPtr++;

uPtr++; u_Ptr++;

}

uPtr++; u_Ptr++;

}

uPtr += nx2; u_Ptr += nx2;

}

Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.

Say, I'd like to add an OpenMP support here. I do the following :

#ifdef _OPENMP

#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)

for (int k = 0; k < ns_2; k++)

{

// Thread-localize data sliders.

double

*loc_UPtr = UPtr + k * nx_1 * ny_2,

*loc_uPtr = uPtr + k * np,

*loc_u_Ptr = u_Ptr + k * np;

// Redefine data sliders.

#define UPtr loc_UPtr

#define uPtr loc_uPtr

#define u_Ptr loc_u_Ptr

#else

for (unsigned int k = 1; k < ns_1; k++)

{

#endif

for (unsigned int j = 1; j < ny_1; j++)

{

for (unsigned int i = 0; i < nx_1; i++)

{

*_C(UPtr) = quat_dtDivdx * (

*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));

THROW_COURANT(*_C(UPtr)); UPtr++;

uPtr++; u_Ptr++;

}

uPtr++; u_Ptr++;

}

uPtr += nx2; u_Ptr += nx2;

#ifdef _OPENMP

// Redefine data sliders.

#undef UPtr

#undef uPtr

#undef u_Ptr

#endif

}

This simple idea came after looking on basic OpenMP examples :

1) set pragma for an outter loop

2) for every slider to create and independent thread-local copy using the preprocessor definitions

OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?

Thanks.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are updated by the other.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello, Tim,

Thanks for reply,

> From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are > updated by the other.

I suppose mythreads are not data-dependent. The generalformula is U = F(u, u_) - here u and u_ are read-only, U is not self-dependent. To make it clear, let me providethe preprocessed source :

double

*UPtr = this->get_TopLevel()->get_Values(),

*u_Ptr = uFlow->levels[uFlow->levelsCount - 1]->get_Values() +

nx + np,

*uPtr = uFlow->get_TopLevel()->get_Values() + nx + np;

#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)

for (int k = 0; k < ns_2; k++)

{

double

*loc_UPtr = UPtr + k * nx_1 * ny_2,

*loc_uPtr = uPtr + k * np,

*loc_u_Ptr = u_Ptr + k * np;

for (unsigned int j = 1; j < ny_1; j++)

{

for (unsigned int i = 0; i < nx_1; i++)

{

*((loc_UPtr)) = quat_dtDivdx * (

*((loc_u_Ptr)) + *((loc_u_Ptr + 1)) + *((loc_uPtr)) + *((loc_uPtr + 1)));

if (abs(*((loc_UPtr))) > 1e0) throw *((loc_UPtr));; loc_UPtr++;

loc_uPtr++; loc_u_Ptr++;

}

loc_uPtr++; loc_u_Ptr++;

}

loc_uPtr += nx2; loc_u_Ptr += nx2;

}

}

So here, in parallel version, I'm trying to provide each k-iteration with independent sliders copies (names starting with loc_) and corresponding offsets.

Now, about timing. When enclosing the cycle above in clock()-s,the result varies from 0.0149 to 0.016 sec, same for serial and parallel versions. If I change clock()-s to omp_get_wtime(), the result varies from 0.0156 to 0.018 secfor serial and from0.0110 to 0.118 sec for parallel. This timings differ a little from test to test, anyway as for omp_get_wtime()parallel seems to be30% faster than serial.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page