Emulating an OpenMP parallel for with TBB - Page 2

ipapadop · ‎04-13-2011

I was wondering if it is possible to have a similar construct like an OpenMP parallel for with TBB.

For example, in OpenMP I can do the following:

void foo(void) {
int i;
#pragma omp parallel for private(i)
for (int i=0; i<10000; ++i) {
int tid = omp_get_thread_num(); // current thread id
int nth = omp_get_num_threads(); // current number of threads
// do sth - even if there are dependencies
}
}

I am aware of the existence of the parrallel_* algorithms, but they make no guarantee that the tasks generated will execute in parallel (and it is explicitly mentioned that dependencies between them can cause a deadlock).

The question is if I am able to express in TBB a fork-join model and have the guarantee that it will be forked as much as possible as the resources allow it.

I believe it should be possible if I would have access to the information on how many threads are already executing something.

jimdempseyatthecove · ‎04-18-2011

A runtime system cannot predict when a busy thread will become idle. Your code though, could have heuristics or have pre-determined tuning parameters (e.g. in OpenMP setting number of threads for next parallel region).

As for partitioning based upon current idel threads, while TBB doesn't have it QuickThread (www.quickthreadprogramming.com) does have a means to partition based upon availability of threads

parallel_for( Waiting$, aFunctor, iBegin, iEnd, otherArgsHere);

If you want to restrict scheduling to current thread plus idel threads but only to threads in the same processor socket as the thread issuing the parallel_...call:

parallel_for( Waiting$+L3$, aFunctor, iBegin, iEnd, otherArgsHere);

There are several other scheduling hints you can use as well too.e.g. One thread per socket:

parallel_for( OneEach$+L3$, aFunctor, iBegin, iEnd, otherArgsHere);

Or, if you know the thread will block on I/O or event

parallel_task( IO$, aFunctor, otherArgsHere);

QuickThread has two classes of threads: Computational only, and I/O
In its parallel_pipeline you can mix threads of different classes along different stages of the pipeline. So your original problem of your f(args) having sections with computation only and other sections with I/O or event waiting could easily be handled with this type of parallel_pipeline.

back to TBB for a moment

You can use an atomic class variable to count the number of busy (or idel) threads in your application. This will take some work on your part but could possibly be done by inserting a class in the front of each task (kind of like profiling). Based on the availability of threads you could use TBB's parallel_invoke

switch(nAvailableThreads)
{
case 0: // count is buggard
case 1:
f(args);
case 2:
argsSplit(args1, args2);
parallel_invoke(
[&](){ f(args1); },
[&](){ fargs2); });
break;
case 3:
...
default:
// split max way here
} // switch

The above is your "traditional" fork and join

Jim Dempsey

ipapadop · ‎04-18-2011

"You can use an atomic class variable to count the number of busy (or idel) threads in your application. This will take some work on your part but could possibly be done by inserting a class in the front of each task (kind of like profiling). Based on the availability of threads you could use TBB's parallel_invoke"

This is actually my current approach (taken from in the TBB reference manual in structured_task_group section) - however, this will break in case TBB is used somewhere in user-code and I'm unaware of it.

parallel_invoke() will try to instantiate n copies of my function but only some of them may execute in parallel, depending on resources. My function is SPMD, so it may contain collective operations that are going to cause a deadlock if all instances are not running.

Rewriting everything using continuations, although it is highly appealing and solves a lot of problems, is not possible.

Alexey-Kukanov · ‎04-18-2011

Quoting ipapadop

@Robert Reed: I am basically asking in the case that I call spawn_root_and_wait() from a task T1, what will the scheduling policy be? Will TBB execute the spawned tasks from T1 and return me to T1 ASAP or will it execute other pending tasks as well? In the latter case, will it work-steal as well?

The thread will return to T1 ASAP, but it's allowed to take other pending tasks and even stealif return is not yet possible (which means, part of spawned job was stolen and not yet completed by thieves).

RafSchietekat · ‎04-18-2011

""Wrong. TBB deals more efficiently with the opportunity cost of parallel execution."
That's completely orthogonal."
How does the program adapt to incorrect estimations of the workloads at launch time and/or different progress at run time (otherwise they don't finish together and some processing units sit idle)? How does it adapt to new processing elements becoming available while the work is in progress, or conversely to new high-level parallel slack elsewhere in the program? How does work distribution scale? Etc. Jim, I'd like to hear your take on that: did you include this as a feature (to be used with due caution), or are you less apprehensive about its use than I was taught to be?

"Gender-neutral pronoun..."
Actually, "he" is understood to be gender-neutral, and "she" is politically correct in overdrive (for the sake of future linguists studying old archives).

""I still see no inherent required concurrency."
Bottom line is: I cannot express the algorithms using only the TBB provided structures; not because I do not want to, because I can't. They are written in a data parallel way and use distributed containers, and rewriting them is not desirable. I have to do the best I can while at the same time, use task parallelism - and then when speedups are there, I can convince users to change their algorithm. And I do not see why the answer to my question "Is this possible?" has essentially to be "Your question is wrong"."
Are you saying that you have to deal with legacy code written for something like a grid that cannot be economically adapted to nimbler SMP? I can understand that as a real-world external constraint (although it would have been more productive to reveal this earlier in the discussion), and if it were prevalent and important enough I wouldn't be surprised if TBB would add support for it for that reason alone. But I'd be interested to watch a race between the two approaches (how did "race" become a bad thing?).

Alexey-Kukanov · ‎04-18-2011

Quoting ipapadop

"With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear."
Good - at least that's a good lead that I didn't know.

Not completely automagically. There is a special component for this, and OpenMP usage should follow some rules for the coordination to have effect. Write me an email if you want to learn more.

jimdempseyatthecove · ‎04-19-2011

>>Jim, I'd like to hear your take on that: did you include this as a feature (to be used with due caution), or are you less apprehensive about its use than I was taught to be?

In a parallel task paradigm, such as TBB and my QuickThread (singular, not QuickThreads plural), it is relatively easy for the task scheduler to know when threads become idel (this is part of its job). What is difficult though (without program hints) is how much compute time is remaining in any given task. None of the threading tool kits offers this capability, but someone could add this to their application (progression indication). When it is known that you have n idle threads, you could potentially schedule n+1 threads for the parallel_for or other parallel construct assuming your threading toolkit has this capability. OpenMP does as does QuickThread. QuickThread extended this capability by introducing filters to the "count the threads in setxxx" (e.g. cache level or socket). In addition to count, the scheduling can optionally restrict the threads to the cache level or socket of interest. This does not guarantee that what you see available at the time of scheduling decision will be immediately available after you make the decision. Nor does it guarantee that after you make the scheduling selection that an additional thread in that setbecomes free.

QuickThread does have feature whereby when you process a parallel_list, that the skip list automatically reconfitures as new thread become available during execution of the list.

Also, since QuickThead uses (example)

parallel_for(OptionalSelector$, functor, iBegin, iEnd, argsHere);

To enqueue

void functor(intptr_t iBegin, intptr_t iEnd, arg_t arg, ...)
{
for(intptr_t i = iBegin; i .lt. iEnd; ++i)
{
...
doSomething();
...
}
}

Of particular interest here is unlike TBB, the iteration space is visible to the function of the functor.
What this means is, after scheduling, during execution of the task loop, you can observe if additional threads become available. If they do, and conditions are acceptible, you can split the work.

Example of a conditional 2-way fork

void functor(intptr_t iBegin, intptr_t iEnd, arg_t arg, ...)
{
for(intptr_t i = iBegin; i .lt. iEnd; ++i)
{
if((iEnd - iBegin .gt. threshold) && qt_CountOfWaitingThreads())
{
intptr_t half = iEnd- iBegin/2;
parallel_invoke(
[&]() { functor(iBegin, iBegin+half, arg, ...); },
[&]() { functor(iBegin+half, iEnd, arg, ...)});
return;
}
...
doSomething();
...
}
}

or insertn-way fork if you prefer.

If you get tired of doing that then write a template that does this for you.

There are different ways to qualify which threads are waiting.

Jim Dempsey

RafSchietekat · ‎04-19-2011

But what, if any (other than legacy requirements), would be the circumstances that would entice you as a programmer to use this feature (even the one sentence you quoted from that paragraph was really a question of "why", not "what")? How far away a thread in the NUMA hierarchy would you allow to come and collaborate potentially low in the dependency graph (it would seem wasteful to invite a thread that does not share the same cache and should pick another high-level task instead), and wouldn't NUMA awareness be a precondition for doing that? I don't see such a restriction in the use of qt_CountOfWaitingThreads() in the example, though.

jimdempseyatthecove · ‎04-20-2011

Raf,

To check for waiting thread on same socket and schedule to same socket modify the prior code to use

if((half .gt. threshold) && get_qtControl()->CountOfWaitingThreads(L3$))

To get the count of waiting threads within L3 distance of current thread.

Then

parallel_invoke(
L3$, // restrict to within L3
[&](){ functor(iBegin, iBegin+half, arg1, ...); }
...

If you haven sockets per NUMA node (n .gt. 1) and more than one NUMA node then consider replacing L3$ above with M0$ (within 0 NUMA hops from current CPU). Assuming that is what you want.

QED

Jim Dempsey

RafSchietekat · ‎04-21-2011

That's a lot of control, very nice! Now, would you agree that this selectivity is essential for performance? And it seems a bit strange to have to insist, if you'll forgive me for that, because isn't that why you would have built in this level of controllability, e.g., because if it stands to reason that stealing low-level tasks is bad if the NUMA distance is considerable, then why would one willingly invite such a thief instead of one sharing an already warmed-up cache (if that is the appropriate term)? And, in the absense of external requirements to have mandatory concurrency or a way to select threads, wouldn't it then be better to not ask for a particular level of parallelism than to let any thread join in?

jimdempseyatthecove · ‎04-21-2011

>>you agree that this selectivity is essential for performance?

Sometimes yes, sometimes no. You have the ability to select when it makes a difference.

An additional thing to consider is:

a) parallel_for(...);
b) parallel_for(L3$, ...);
c) parallel_for(Waiting$+L3$, ...);
d) parallel_for(ExcludeMyself$+L3$,...);

a) "standard" parallel for - all threads
b) all threads within socket of current threadregardless of busy or not.
c)current threadplus any waitingthread in current socket
d) all threads incurrent socket except current thread

In QuickThread you can choose if the current thread participates or not and/or blocks or not.
The above sketches shows some of the flexibility ofselecting the degree of "Hot in cache".
(locality filters: L0$, L1$, L2$, L3$, M0$, M1$, M2$, M3$)

Now what about"Cold in cache". e.g. you know thedata is not in cache and you would like to push the work off to a socket that has the most available threads:

parallel_for(NotInCache$+L3$, ...);

>>And, in the absense of external requirements to have mandatory concurrency or a way to select threads, wouldn't it then be better to not ask for a particular level of parallelism than to let any thread join in?

In those situations, don't add a qualifier.

There are several other things you can do quite easily using QuickThread that are difficult to do (or not possible to do) with other threading paradigm. Consider creating a major parallel task, scheduling this task, then indicating when theparallel task completes, that you wish to do something else. i.e. a task completion routine. For controlable tasks in QuickThread you declare and use a task structure control:

qtControl yourTaskControl;

this can be static or scoped. Then

// parallel for in current socket without current thread
// and without blocking
parallel_for(
ExcludeMyself$+L3$,
&yourTaskControl,
aFunctor, iBegin, iEnd, args here...);
// immediate return from parallel_for
// Now enqueue a completion routine
//to the above task control such that
//when above parallel for completes
//your completionroutine runs
parallel_task( OnDone$,
&yourTaskControl,
yourCompletionRoutine,
itsArgsHere);
// main thread still running here
...
// if you want then some time later
yourTaskControl.waitTillDone();

(scoped task controls have implicit waitTillDone() in dtor of qtControl)

Also, it is not unusual for an application to have tasks that perform I/O. QuickThread has two classes of threads: Compute and I/O

Compute class threads are default. I/O class is indicated by inclusion of qualifier IO$, or in the case of an I/O completion routine to a compute routine use IOOnDone$.

In a parallel paradigm like OpenMP and TBB all threads are equal. Should you schedule a thread in those, a thread that blocks for I/O removes a thread from you compute thread pool.In OpenMP or TBB to correct for this you may be inclined to add a thread (or two). Doing this produces oversubscription whennot waiting for I/O. Notincluding additional threads produces undersubscription when thread waiting for I/O. In QuickThread you have none of this because you have two classesof threads (assuming you take advantage of this and program accordingly).

Jim Dempsey

RafSchietekat · ‎04-21-2011

I'm getting the impression (please correct me if I'm mistaken) that this is not about my question about the original matter of a low-level task actively grabbing threads that could otherwise have chosen to steal a high-level task.

jimdempseyatthecove · ‎04-21-2011

Task level/priority is a bit different in QuickThread than it is in TBB, OpenMP, Cilk, ...

Each thread has its own queue.
Each thread knows the relationships as to which threads share various cache levels and NUMA nodes

(n producers n consumers)

parallel_invoke dispatched tasks have highest priority within restrictions placed upon it, there are 4 levels per thread, these tend to get processed LIFO, but some enqueuing sequences can reorder one or more prior parallel_invoked enqueued tasks (to that thread).

parallel_...(other) has two types of queues an n-slot (currently 4) similar to the parallel_invoke queue but of lesser priorityand with higher degree of filtering restrictions.

The second type of parallel_...(other) is a FIFO linked list. This has the lowest priority.
You can also enqueue task completion nodes in yet another FIFO linked list (one list per task control)

The parallel_...(other) can enqueue to either the (predominantly) FIFO n-slot list with higher priority than unlimitedFIFO listor directly to the lowest priority unlimited FIFO list.

If a deadlock is detected, then the affected queues are reorderd.
Under an exceptionally unusual circumstance (low memory), there is yet another queue.

***

Task stealing has optional filtering capability:

no restrictions as to who steals task
restricted to within a specified cache or NUMA level distance from the owning thread
restricted to specific thread or pre-defined list of threads (specified at enqueue time)

Due to filtering, dequeue order can vary from that of an unfiltered dequeue.

Also, task stealing takes priority of examining queues by order of closeness of cache/NUMA level.
IOW thread performs

check its own queue
check queues of threads sharing L1 (e.g. HT sibling)
check queues of threads sharing L2
... L3
... same NUMA node
... one NUMA hop
... two NUMA hops
... three NUMA hops

(above with filtering restrictions)

Task stealing, when it occure (and it does occure) steals not only by way of FIFO/LIFO but then subsequently by cache level proximity (closest chosen first). Although this sounds like a lot of work, I spent a lot of time getting this optimized.

This provides a rather large (complex) choice of enqueuing/dequeuing flexibility. If you do not need it - do not use the features. When you do need it it is there.

BTW my parallel_pipeline takes advantage of several of these features, it can really fly.

Jim Dempsey