Community
cancel
Showing results for
Did you mean:
Highlighted
Beginner
9 Views

## OpenMP no speedup

Hello, I've got several huge loops fashioned as follows :

`for (unsigned int k = 1; k < ns_1; k++){for (unsigned int j = 1; j < ny_1; j++){for (unsigned int i = 0; i < nx_1; i++){*_C(UPtr) = quat_dtDivdx * (*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));THROW_COURANT(*_C(UPtr)); UPtr++;`
`uPtr++; u_Ptr++;}`
`uPtr++; u_Ptr++;}`
`uPtr += nx2; u_Ptr += nx2;`
`}`

Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.

Say, I'd like to add an OpenMP support here. I do the following :

`#ifdef _OPENMP#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)for (int k = 0; k < ns_2; k++){// Thread-localize data sliders.double*loc_UPtr = UPtr + k * nx_1 * ny_2,`
`*loc_uPtr = uPtr + k * np,*loc_u_Ptr = u_Ptr + k * np;`
`// Redefine data sliders.#define UPtr loc_UPtr#define uPtr loc_uPtr#define u_Ptr loc_u_Ptr#elsefor (unsigned int k = 1; k < ns_1; k++){#endiffor (unsigned int j = 1; j < ny_1; j++){for (unsigned int i = 0; i < nx_1; i++){*_C(UPtr) = quat_dtDivdx * (*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));THROW_COURANT(*_C(UPtr)); UPtr++;`
`uPtr++; u_Ptr++;}`
`uPtr++; u_Ptr++;}`
`uPtr += nx2; u_Ptr += nx2;`
`#ifdef _OPENMP// Redefine data sliders.#undef UPtr#undef uPtr#undef u_Ptr#endif}`

This simple idea came after looking on basic OpenMP examples :

1) set pragma for an outter loop

2) for every slider to create and independent thread-local copy using the preprocessor definitions

OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?

Thanks.

5 Replies
Highlighted
Black Belt
9 Views
clock() measures total time used by all threads, so you would have an excellent result if that time doesn't increase with threading. OpenMP provides the function omp_get_wtime() for measuring elapsed time, on Intel compatible platforms, __rdtsc() may be useful.
From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are updated by the other.
Highlighted
Beginner
9 Views

Hello, Tim,

> From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are > updated by the other.

I suppose mythreads are not data-dependent. The generalformula is U = F(u, u_) - here u and u_ are read-only, U is not self-dependent. To make it clear, let me providethe preprocessed source :

`double *UPtr = this->get_TopLevel()->get_Values(),*u_Ptr = uFlow->levels[uFlow->levelsCount - 1]->get_Values() +nx + np,*uPtr = uFlow->get_TopLevel()->get_Values() + nx + np;`
`#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)for (int k = 0; k < ns_2; k++){double*loc_UPtr = UPtr + k * nx_1 * ny_2,`
`*loc_uPtr = uPtr + k * np,*loc_u_Ptr = u_Ptr + k * np;`
`for (unsigned int j = 1; j < ny_1; j++){for (unsigned int i = 0; i < nx_1; i++){*((loc_UPtr)) = quat_dtDivdx * (*((loc_u_Ptr)) + *((loc_u_Ptr + 1)) + *((loc_uPtr)) + *((loc_uPtr + 1)));if (abs(*((loc_UPtr))) > 1e0) throw *((loc_UPtr));; loc_UPtr++;`
`loc_uPtr++; loc_u_Ptr++;}`
`loc_uPtr++; loc_u_Ptr++;}`
`loc_uPtr += nx2; loc_u_Ptr += nx2;`
`}}`

So here, in parallel version, I'm trying to provide each k-iteration with independent sliders copies (names starting with loc_) and corresponding offsets.

Now, about timing. When enclosing the cycle above in clock()-s,the result varies from 0.0149 to 0.016 sec, same for serial and parallel versions. If I change clock()-s to omp_get_wtime(), the result varies from 0.0156 to 0.018 secfor serial and from0.0110 to 0.118 sec for parallel. This timings differ a little from test to test, anyway as for omp_get_wtime()parallel seems to be30% faster than serial.

Highlighted
Beginner
9 Views
I've heard for OpenMP it is worse to parallelize a code containing pointers (like I do) instead of arrays with indexers []. Is it true?
Highlighted
Black Belt
9 Views
Pointers wouldn't necessarily be a problem if you made them and the loop indices private.
Highlighted
Beginner
9 Views
OK, thanks!