Re: OpenMP-compatible threading of non-inline functions

marcusexoticmatter_c · ‎08-20-2009

Hello,
I recently purchased the Intel C++ Compiler 11.1, having previously been using gcc, and thus, I am new to this Intel forum. I apologize beforehand if I've put this question in the wrong forum... :)

I am developing a Linux/OSX based application which features a programming API my customers can use to write their own plug-ins. One of the main features of this API is the idea of function callbacks, where the user can write his/her own function, and then having the API access this function through a callback mechanism. Another big feature of my application is that it's threaded using OpenMP. This is fine, except for the restriction that the functions threadable by OpenMP must, essentially, have no side-effects outside the scope of the thread, which means it has to be an inline function calling only other inline functions, with only const parameters (or thread-private variables). That is fine. However, for this callback mechanism, there is no way to do this because the function to be called is unknown at the time when my API is compiled (it's just a function-pointer).

Given that for 99% of other cases involving threading, OpenMP is fine since it's all internally threaded code where I can provide inline implementations etc. So I could use pthreads or something like that for this once instance of the callback. However I've heard that mixing OpenMP and pthreads (or other threading systems) is unsafe, so I am asking for some advice on how to do this. Is there no way to use the OpenMP API to do pthreads-like "manual" forks and joins, where I can pass in any C function, inline or not?

Please advice,
thank you
Marcus Nordenstam

jimdempseyatthecove · ‎08-20-2009

Marcus,

There is nothing inherent aboug non-inline functions that make them unsuitable for use with OpenMP.

How you program the function may have an adverse effect.

What you need to do as a programmer is make sure the function is multi-threaded safe.

Usually this means no static variables or global variables without proper protection. For example, if your function has a static or global counter that is incremented each time it is entered.

You cannot safely perform a count++. You must make the increment thread-safe by use of an atomic means of increment

_InterlockedIncrement(&count);

or

#pragma omp atomic
count ++;

or other means.

Jim Dempsey

TimP · ‎08-20-2009

Most linux or MacOS implementations of OpenMP, including Intel, are based on pthreads, while most successful Windows OpenMP implementations are based on Windows threads. So, your prospects for mixing pthreads with OpenMP are better on linux and MacOS than on Windows. Thus, a mixture of OpenMP and pthreads is less portable than either by itself.
The default OpenMP library of icc 11.1 supports libgomp calls from gcc/g++, so you can mix gomp with icc/icpc.
I'm not understanding your question about manual forks and joins. You certainly can make an OpenMP parallel region internal to a function, although it's more efficient if you have a minimum amount of work per parallel region.

marcusexoticmatter_c · ‎08-21-2009

Thanks Jim!
From what I had read online, it seemed like I could only call inline functions inside an OpenMP parallel region. This news makes my life a lot easier! I will simply use OpenMP for everything then. Of course the code in the parallel region (including any called functions) will be thread-safe :) Cheers.

While I've got your attention, I've one more question, in regards to the Intel Cluster OpenMP. One of the main reasons of using MPI is that you can effectively run a process distributed across several nodes which is effectively addressing a heap larger than the memory which is physically (or virtually) available on any single node, by partitioning the memory usage onto several nodes and using the message passing to communicate among them.

From what I've read about the Cluster OpenMP, on the other hand, the heap is "shared" across the nodes in the cluster (even though each node holds its own copy and uses memory page protection to synchronize the memory between the nodes). If I am understanding this correctly, then each node still has to hold the entire heap in memory making it impossible to run distributed processes which, taken together, use, say, a heap-size 10x the size of the node's physical memory. If I have to manually partition the heap (like I would with MPI), then what is the advantage of using Cluster OpenMP vs MPI?

Thanks for your advice,
marcus

Quoting - jimdempseyatthecove

Marcus,

There is nothing inherent aboug non-inline functions that make them unsuitable for use with OpenMP.

How you program the function may have an adverse effect.

What you need to do as a programmer is make sure the function is multi-threaded safe.

Usually this means no static variables or global variables without proper protection. For example, if your function has a static or global counter that is incremented each time it is entered.

You cannot safely perform a count++. You must make the increment thread-safe by use of an atomic means of increment

_InterlockedIncrement(&count);

or

#pragma omp atomic
count ++;

or other means.

Jim Dempsey

TimP · ‎08-21-2009

The advantage of cluster openmp over MPI is that it shares much OpenMP syntax. It never claimed to have the performance potential of MPI.

Michael_K_Intel2 · ‎08-23-2009

Quoting - marcusexoticmatter.com

From what I've read about the Cluster OpenMP, on the other hand, the heap is "shared" across the nodes in the cluster (even though each node holds its own copy and uses memory page protection to synchronize the memory between the nodes). If I am understanding this correctly, then each node still has to hold the entire heap in memory making it impossible to run distributed processes which, taken together, use, say, a heap-size 10x the size of the node's physical memory. If I have to manually partition the heap (like I would with MPI), then what is the advantage of using Cluster OpenMP vs MPI?

Marcus,

Your understanding is not entirely correct. With Cluster OpenMP, a node's memory is divided into three areas: non-shareable local memory, cache memory for remote data, and shared memory. The remote cache is used to temporarily store pages that have been transferred from shared memory of other nodes for local access. This is what literatues refers to as "distributed shared memory". If a thread running on node A accesses data that is stored on node B, it requests the data from B and allocates it in its local cache.Accesses then happen to be local until the data is flushed back to their allocating nodes.

Hence, the total amount of memory your application can use is roughly n times the size of the shared memory, with n being the number of nodes running the application.

Cheers,
-michael

marcusexoticmatter_c · ‎08-23-2009

Quoting - Michael Klemm, Intel

Marcus,

Your understanding is not entirely correct. With Cluster OpenMP, a node's memory is divided into three areas: non-shareable local memory, cache memory for remote data, and shared memory. The remote cache is used to temporarily store pages that have been transferred from shared memory of other nodes for local access. This is what literatues refers to as "distributed shared memory". If a thread running on node A accesses data that is stored on node B, it requests the data from B and allocates it in its local cache.Accesses then happen to be local until the data is flushed back to their allocating nodes.

Hence, the total amount of memory your application can use is roughly n times the size of the shared memory, with n being the number of nodes running the application.

Cheers,
-michael

Thanks Michael,
That does make perfect sense
/marcus