OpenMP in multiple user threads and pooling

quince · ‎10-09-2008

Suppose I have multiple user threads created with boost::threads for example. What performance considerations would there be if more than one of these threads has OpenMP code inserted in it? Will OpenMP thread pools created by separate user threads be shared? What I'd like to do is have separate render, animation, simulation, input, networking, etc. threads, and the animation and simulation would each benefit from OpenMP. I don't want to use omp sections for these because at this higher level of granularity it's not very suitable; I'd rather leave it for the data parallel stuff. But how do I avod potential performance penalties, oversubscription of threads to cores, etc.?

jimdempseyatthecove · ‎10-09-2008

Quoting - quince

Suppose I have multiple user threads created with boost::threads for example. What performance considerations would there be if more than one of these threads has OpenMP code inserted in it? Will OpenMP thread pools created by separate user threads be shared? What I'd like to do is have separate render, animation, simulation, input, networking, etc. threads, and the animation and simulation would each benefit from OpenMP. I don't want to use omp sections for these because at this higher level of granularity it's not very suitable; I'd rather leave it for the data parallel stuff. But how do I avod potential performance penalties, oversubscription of threads to cores, etc.?

You correctly have a basic concern for over subscription of threads when mixing threading models. I would suggest you carefully look at your code to assess if your interests is in a quick to implement route or if you are looking to maximize performance. It is not clear from your post if your current code spawns (boost::threads) on each iteration or if your threads are spawned once and use event flags as opposed to polling. Preferably you spawn once anduse event driven threads.

Assuming you want a quick to implement method (maximize performance might use something like TBB), for quick and easy method (pseudo code)

main
{
OMP enable nested parallization
OMP set number of threads2
OMP Parallel Sections
OMP Section
BostThreads(AllButAnimationAndSimulation)
Animation();
OMP Secton
Simulation();
OMP End Parallel Sections
}

Anamation
{
while(WaitForMessage)
{
if(Done || Abort) return;
OMP Set Number Of Threads (you choose)
OMP Parallel Do
Your Anamaition loop
OMP End Parallel Do
}
}

Simulation
{
while(WaitForMessage)
{
if(Done || Abort) return;
OMP Set Number Of Threads (you choose)
OMP Parallel Do
Your Simulaition loop
OMP End Parallel Do
}
}

There are other alternatives such as TBB and some I cannot mention on this forum.

The trick here is to try to balance the system such that the Render phase is not starved for cpu cycles. If you are running on 4 or more cores then the OpenMP threads could use 3 cores and be restricted using Affinity to running on 3 of the 4 cores. This would leave one core for the other threads. On a 2 core system the tactic might be: do not use Affinity, do not limit the number of cores in Animation and Physics (run with 2) but in the Animation and Physics loop insert a call to SwitchToThread() which is a low overhead run something else if something else available to run (much less overhead than Sleep(0);). Note, you can also make the call to SwitchToThread() on every n'th iteration in these loops. This techinque can also be used on the 4 core system if more compute time is require of the Animation and Physics loops (and less by the time requied by the other threads).

Jim Dempsey

Dmitry_Vyukov · ‎10-10-2008

Quoting - quince

Suppose I have multiple user threads created with boost::threads for example. What performance considerations would there be if more than one of these threads has OpenMP code inserted in it? Will OpenMP thread pools created by separate user threads be shared?

I beleive that sane OpenMP implementation will not create oversubscription of threads to processors, i.e. there will be only one set of worker threads no matter how many user threads use OpenMP. Althrough I don't know for sure. You can easily check it - run the program and check number of threads in Task Manger (Windows), or in 'ps' (*nix).

quince · ‎03-11-2009

Thanks for the replies. One thing I was thinking is using a lock-free queue (already implemented) to store the work, and another to pass it to the renderer.

For example:

AtomicQueue in, out;
bool exit;

#pragma omp sections
{
#pragma omp section
{
SDL_CreateThread(LogicLoop, in);
SDL_CreateThread(SceneGraph::ComputeOrder, in);
while (!exit) DrawFrame(out); // Blocks on VSync or not all drawable objects collected from out for this frame
}
#pragma omp section
{
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < num_threads; ++i)
{
if (in.empty()) continue;
Task task;
in.pop(task);
if (!task.dataBufferFull()) Yield(); // Not sure the best way to do here
task.Doit();
out.push(task);
}
}
}

With double-buffering (or more) the data for tasks, Draw can be copying out of completed buffers it collects from the queue regardless of the order of completion (OpenGL commands have to be serialized in the same thread). It's also pretty easy to precompute ahead certain items for multiple frames, by extending buffers (say for ones not influenced by user input).

TimP · ‎03-11-2009

The primary way to use OpenMP inside another parallel framework has been with MPI. In this case, these are separate OpenMP tasks, and each must have its threads affinitized to cores associated with its own set of cache. e.g. 1 OpenMP task per socket. OpenMP can keep track of the total number of OpenMP threads on a shared memory node, but can't account for threads outside OpenMP. Without a supported scheme to pass the preferred affinities to each OpenMP task, such as Intel MPI 3.2 has, it's difficult to keep multiple OpenMP tasks on a single shared memory platform from interfering among themselves.
In your case, you probably want to see if one instance of OpenMP is capable of running multiple parallel regions, where you set a small num_threads for each parallel region, and set a nowait on each parallel section so another can start (using remaining threads from the same pool), then explicitly wait for each region to complete, where you need that. I haven't seen an example.

quince · ‎03-16-2009

I assume that OpenMP keeps a thread pool, and #pragma omp task are assigned to those threads dynamically. However, if I understand correctly, the problem is that although one can specify thread affinity on Intel's compiler, one can't specify task affinity. So a coupel of questions: To what extent is this a problem among cores of a single CPU (as opposed to multiple CPUs, which is not a concern for me at the monent)? Is there task affinity support planned for the future, as some other platforms have? And is it possible to check within an omp task which core the thread that happens to be running this task is on? The last question is because in that case I can have multiple task queues and for a thread on a given core I would make it prefer a given task queue. Thanks.

pvonkaenel · ‎03-17-2009

Quoting - quince

I assume that OpenMP keeps a thread pool, and #pragma omp task are assigned to those threads dynamically. However, if I understand correctly, the problem is that although one can specify thread affinity on Intel's compiler, one can't specify task affinity. So a coupel of questions: To what extent is this a problem among cores of a single CPU (as opposed to multiple CPUs, which is not a concern for me at the monent)? Is there task affinity support planned for the future, as some other platforms have? And is it possible to check within an omp task which core the thread that happens to be running this task is on? The last question is because in that case I can have multiple task queues and for a thread on a given core I would make it prefer a given task queue. Thanks.

I think I've seen sample code that outlined how to assign OpenMP threads to specific cores. I think you need to check the thread number in the parallel section using one of the omp_ functions defined in omp.h, and then set affinity using whatever OS routine is available (SetThreadAffinityMask() in Win32).

jimdempseyatthecove · ‎03-17-2009

Quoting - pvonkaenel

Quoting - quince

I assume that OpenMP keeps a thread pool, and #pragma omp task are assigned to those threads dynamically. However, if I understand correctly, the problem is that although one can specify thread affinity on Intel's compiler, one can't specify task affinity. So a coupel of questions: To what extent is this a problem among cores of a single CPU (as opposed to multiple CPUs, which is not a concern for me at the monent)? Is there task affinity support planned for the future, as some other platforms have? And is it possible to check within an omp task which core the thread that happens to be running this task is on? The last question is because in that case I can have multiple task queues and for a thread on a given core I would make it prefer a given task queue. Thanks.

I think I've seen sample code that outlined how to assign OpenMP threads to specific cores. I think you need to check the thread number in the parallel section using one of the omp_ functions defined in omp.h, and then set affinity using whatever OS routine is available (SetThreadAffinityMask() in Win32).

*** Caution ***

The OpenMP function get_thread_num() does NOT return a thread number.
It returns the thread team member number.
Only when not using nested parallel regions can you assume the get_thread_num() is a thread number.
When using nested parallel regions, for each nest level, you will have get_thread_num() thread numbers of0:n-1, where n is the number of threads in the team at that particular level.

A solution to this is to create a thread private variable compiled with initializationas -1 then when want a thread number test this value for -1, when -1 use an _InterlockedIncrement on a seperate variable also initialized to -1. The result from that can be used as a process wide unique thread number. After you have that number store it into the thread private variable that you used as the thread number. (you can also use the thread handle if on Windows, but the thread private variable will have less overhead once initialized).

Jim Dempsey

pvonkaenel · ‎03-18-2009

Quoting - jimdempseyatthecove

*** Caution ***

The OpenMP function get_thread_num() does NOT return a thread number.
It returns the thread team member number.
Only when not using nested parallel regions can you assume the get_thread_num() is a thread number.
When using nested parallel regions, for each nest level, you will have get_thread_num() thread numbers of0:n-1, where n is the number of threads in the team at that particular level.

A solution to this is to create a thread private variable compiled with initializationas -1 then when want a thread number test this value for -1, when -1 use an _InterlockedIncrement on a seperate variable also initialized to -1. The result from that can be used as a process wide unique thread number. After you have that number store it into the thread private variable that you used as the thread number. (you can also use the thread handle if on Windows, but the thread private variable will have less overhead once initialized).

Jim Dempsey

Hi Jim,

Thanks for the tip. It looks like without it, you could end up with a fairly trickly to debug affinity problem.

Peter