Emulating an OpenMP parallel for with TBB

ipapadop · ‎04-13-2011

I was wondering if it is possible to have a similar construct like an OpenMP parallel for with TBB.

For example, in OpenMP I can do the following:

void foo(void) {
int i;
#pragma omp parallel for private(i)
for (int i=0; i<10000; ++i) {
int tid = omp_get_thread_num(); // current thread id
int nth = omp_get_num_threads(); // current number of threads
// do sth - even if there are dependencies
}
}

I am aware of the existence of the parrallel_* algorithms, but they make no guarantee that the tasks generated will execute in parallel (and it is explicitly mentioned that dependencies between them can cause a deadlock).

The question is if I am able to express in TBB a fork-join model and have the guarantee that it will be forked as much as possible as the resources allow it.

I believe it should be possible if I would have access to the information on how many threads are already executing something.

RafSchietekat · ‎04-13-2011

The question is why would you do that except to make the program run slower than it should?

There may be a more appropriate way to translate those dependencies than simple transliteration, so to say.

ipapadop · ‎04-13-2011

I know that TBB was not made with this in mind and that they could probably be scalability issues with a data parallel approach but I still would like to experiment with fork-join in TBB.

I tried to implement it with a task_list and have a scalable fork, however I'm still lacking the information on how many threads are readily available.

RafSchietekat · ‎04-13-2011

TBB's message is that required concurrency (as opposed to optional parallelism) is bad for you(r performance), so you'll have to forgive it for not handing you the rope with which to go hang yourself. If you just want to experiment, stick with OMP or employ user threads for the required concurrency, and perhaps you'll still have some opportunity to add features from TBB to help you with the performance (hybrids are possible, but it only makes sense if all those user threads aren't already hogging the machine).

Or you could still tell us what you want to do instead of how you want to do it and let us suggest a better approach.

ipapadop · ‎04-13-2011

I understand what TBB's ideology is - and I do not disagree with what you say.All I want is to execute a workfunction SPMD-style on N threads - and I know for sure that I have N threads executing it; which equals to a fork-join.

However I do not wish to oversubscribe the system in the presence of other executing tasks (in that case, if I try to fork, it will just use the current executing thread).

I am aware of the "dangers" of the data parallel model and I do not suggest adding it to TBB; however I'm not your typical user and I would like to test a data parallel approach through TBB and merge it with task parallelism. In that case I can compare it with OpenMP that offers both (even though OpenMP tasks are severely limited). But for that I need at least a way to know for example that a parallel_for will certainly create N tasks.

Alexey-Kukanov · ‎04-13-2011

No matter whether you use parallel_for or spawn tasks, in TBB you do not know how many threads will work on it. Threads may join and leave work sharing arena at any time.

You may spawn exactly as many tasks as you need (or expect) threads, each waiting on the same barrier.When youhave no other work for TBB in the app/test, the threads will eventually arrive to the barrier. But, any comparison of this hack with OpenMP does not make sense, because OpenMP is designed and optimized for this team-and-barrier programming style (and has compiler support to do it efficiently) while TBB is not.

By the way, this team-and-barrier style has little relation to data parallelism IMHO. What you want is not parallelism but mandated concurrency.

ipapadop · ‎04-13-2011

I believe this would cause a deadlock, especially if the number of tasks is more than the number of threads available to TBB. Why do you call it mandated parallelism? I never said I want N threads, I only want to know how many threads are working on my stuff.

I am basically asking if something like a parallel for is possible in TBB.If TBB cannot do such things ever, that's OK.

IDZ_A_Intel · ‎04-13-2011

If you want to play with data parallelism then maybe you should code up the data parallel portion of your algorithmin Intel Array Building Blocks (part of the interoperable Intel Parallel Building Blocks), a package designed to exploit data parallelism. If that doesn't further your purpose, then perhaps what you really are (unintentionally) talking aboutmandated concurrency, or at least an assurance that a piece of code has been executed by some number of distinct threads.

Intel Threading Building Blocks doesn't work that way. You can think of it as a decentralized scheduler. There's no master controller that knows what all the threads are doing. There is no repository for collecting such statistics (at least in production code)because the cache thrashing required to collect it would kill performance.

You could have a kernel that keeps a tally, marking some slate as each unique thread passes through the code and then determine after the fact how many threads were engaged, but I get the feeling that's not what you want. And even that doesn't guarantee that all the threads that touched it were there simultaneously for long enough tobe called concurrent.

And even if you could get all the available threads to play concurrently, it still probably wouldn't look much like an OMP for-loop. The scheduling will be radically different. If you just use the OMP static scheduler, it will divide your loop of 10,000 into as many pieces as there are available threads (typically either all or one) and give each piece to a thread, which will execute in the specified direction until it's done with its piece, then go to the barrier. Whereas in Intel TBB, one thread will split the list into two pieces of 5,000, split one of those into two pieces of 2,500 and so on until it gets down toa threshold where it executes the piece it has left; concurrently, idle threads will look for these left behind pieces to steal and start the same splitting process upon them. The number of threads executing upon the 10,000 will then grow to some maximal concurrency as threads finish their pieces of the 10,000 and go off to look for other work. At some point that work will start to become scarce, and the available threads will go off to look for other work and eventually idle themselves. There's no even division of the work across the list and if you preturb the threads enough to observe them, it just looks like chaos.

So, not your typical user, can you be more specific on the types of data parallelism with which you want to experiment?

ipapadop · ‎04-13-2011

I do not wish to use Array Building Blocks.

I am developing a framework that allows the user to run SPMD functions. The user only says:

spmd_execute(Nthreads, f, args...);

and the framework scalably invokes Nthreads-times the function f() with the given arguments.The user does not necessarily have to specify Nthreads.

In OpenMP, a rough abstract implementation of spmd_execute() is:

spmd_execute(Nthreads, f, args...)
{
if (omp_get_level()==0) {
int max_nth = min(omp_get_max_threads(), Nthreads);
#pragma omp parallel for nowait schedule(static, 1)
for (int i=0; i
f(args...);
}
}
else {
f(args);
}
}
f() might do computations, have OpenMP calls, create nested OpenMP parallel regions and/or call spmd_execute() recursively.Becausef() is executed SPMD, I can know how many threads execute it and what is each thread id. This is crucial for my framework, as well as supporting nested parallelism without oversubscribing. However, specifying exactly Nthreads to spmd_execute() is not a big deal.
I am trying to do the same with TBB (in which case I can have my SPMD function f() call TBB inside). I do not want to use threads, as it would oversubscribe the system in case f() is calling TBB.

RafSchietekat · ‎04-13-2011

But what are those dependencies if you don't obey the user's Nthreads argument in the first place? You're not telling us the ultimate goal, and even the mandated how is getting blurry now...

ipapadop · ‎04-13-2011

The user can if she wants to, to specify how many threads are going to execute f(). If she doesn't, then the framework picks a suitable number of threads. Which means:

1) if no other parallel computation exists, fork as much as possible

2) if another parallel computation exists (ie some threads are currently working) use the rest to fork

3) if it is a nested spmd_execute() call, then run on one thread - the one that called you

I do not know the dependencies in f() - consider this user-code. All I can provide is number of threads and current thread id. It is an arbitrary use case, I have no released software or users, I just want to see if it can get done with TBB (as I showed, it is doable with OpenMP).

Alexey-Kukanov · ‎04-14-2011

So sounds it is about using a parallel framework as a mere thread pool manager that keeps control of overall amount of active execution contexts and so tries to prevent over- and undersubcription.

Yes you can do that with OpenMP and cannot do that with TBB, which motto is "think in terms of work (i.e. parallelism in your problem) andnot in terms of workers (threads)". TBB helps experssing the parallelism, in the form of tasks or generic parallel algorithms. The notion of threads or a parallel region is not exposed, so you really cannot do what you want without dangerous hacks.

RafSchietekat · ‎04-14-2011

I think you and TBB will just have to agree to disagree and go your separate ways.

ipapadop · ‎04-14-2011

More or less yes - again, what I gave is an oversimplified version.

In this case when I do task_scheduler_init init(p) and I know that nobody else uses TBB, do I have any guarantees that if I call parallel_for(0, p, f), f() will be executed on exactly p threads?

And if I do spawn_root_and_wait() from a task, when the spawned task is done, will TBB return me to the original or it might try to execute other pending tasks?

IDZ_A_Intel · ‎04-14-2011

Will you get p threads if you're the only IntelTBB player and requested a pool of p threads? Maybe. No guarantees, as we said before. Consider my previous parallel_for example. If we're talking 10,000 items and say, 16 threads, it's likely that all 16 will be employed, but not guaranteed. If there's only 100 items, it's less likely that all 16 threads will be employed. A case I have in mind is where all the free work gets stolen before all the threads in the pool become engaged. You might have 15 threads engage in the work while the 16th vainly rolls a die to find work but never does. The one thing you can guarantee here is that it won't be more than p threads.

I'd try to answer your other question, but I'm not sure I understand it. "return me to the original?" task? If there are other tasks that have been scheduled, the threads in the TBB pool will seek them out. You don't own any threads in TBB, only tasks.

jimdempseyatthecove · ‎04-17-2011

ipapadop,

In your second example (smpd_execute), consider for the sake of argument you enter spmd_execute with a thread at level 0 (main OpenMP thread of process). This establishes (or enlists) the max_nth thread pool to perform the work (as n-way fork). Immediately after this part all the threads will be busy chewing away at function f(args).

Assume also that the work to be performed in f(args) and/or the latency through f(args) is un-equal.Note f(args) could perform I/O, have page faults, or get pre-empted by other process on system. What happens in this situation is some threads complete earlier than others. No problem here (supposedly) because you have nowait on the #pragma omp.... Now let's see what can supposedly happen...

Under this circumstance, when the threads that are not the main thread complete... those threads go idel (may allso burn time in KMP_BLOCK_TIME). Should the main thread complete before the other threads complete, the main thread exits spmd_execute and presumably runs into code the calls spmd_execute again (with same or different f(args)). On this second call it will detect it calls from level 0 and will assume all threads are available. It is quite possible that some of the threads from the first call are still busy, but your code is executing as if theyare available.

In the former case you have idle threads, in the latter case you have a quasi over-subscription of threads on calls from outer most level and under-subscription of potentially soon to become available threads. So, excepting for an application thathas only one nest (recursion) level, your smpd_execute would likely exhibit ineffective use of available cores.

Jim Dempsey

RafSchietekat · ‎04-18-2011

It still would be nice to know for certain what this is all about for this discussion to be meaningful: a problem that could be refactored to shine on top of TBB, or an essentially different paradigm with requirements like, e.g., actor-based concurrency's. With performance and fairness typically (and perhaps essentially) in opposition, and TBB deliberately choosing the side of performance, an essential trade-off must be made, which is hard to do without knowing the ultimate goal. And even if everything is clearly understood and doable, there is the cost-benefit question between making the necessary changes (hopefully without adverse effects elsewhere or burdensome maintenance efforts afterwards) and being able to use them in other areas as well. Hence my previous response.

ipapadop · ‎04-18-2011

@Robert Reed: I am basically asking in the case that I call spawn_root_and_wait() from a task T1, what will the scheduling policy be? Will TBB execute the spawned tasks from T1 and return me to T1 ASAP or will it execute other pending tasks as well? In the latter case, will it work-steal as well?

ipapadop · ‎04-18-2011

@Jim Dempsey:Any decent runtime should be able to detect how many worker threads are available and divide the work accordingly right?

At least the OpenMP runtimes I have tried do that - if you specify the correct policy, they will only fork as much as possible, without oversubscribing. And remember, I am building a framework, so if the user does not specify how many threads she wants, I'm deciding how many I am going to use.

@Raf Schietekat: I am playing with distributed containers if that helps you. I am partitioning my container in n blocks and there are many different partitioning schemes. I can then use the spmd_execute() to apply algorithms on the distributed container. My framework decides how many threads to spawn for executing the algorithm, based on locality, number of available threads etc (yes, I do use knowledge regarding locality so it is not an issue).

I could allocate my container using the scalable allocator and let TBB do its magic as regards to affinity, but most of the times this is not possible for various reasons.

In OpenMP this thing works but I cannot interface correctly with TBB since 1) it does not allow me to say "i want n threads to execute this thing" and 2) it does not take into account how many threads are currently busy doing OpenMP stuff.I am also coming to the conclusion that because of the lack of this support (which is a design decision which I am not debating) that I cannot have it my way.

RafSchietekat · ‎04-18-2011

"I am basically asking in the case that I call spawn_root_and_wait() from a task T1, what will the scheduling policy be? Will TBB execute the spawned tasks from T1 and return me to T1 ASAP or will it execute other pending tasks as well? In the latter case, will it work-steal as well?"
See Reference Manual/Task Scheduler/Scheduling Algorithm: returning from spawn_root_and_wait() has the highest priority unless you explicitly override it, but there is no telling how long a thread may be otherwise occupied once it has stolen a task.

"Any decent runtime should be able to detect how many worker threads are available and divide the work accordingly right?"
Wrong. TBB deals more efficiently with the opportunity cost of parallel execution.

"At least the OpenMP runtimes I have tried do that - if you specify the correct policy, they will only fork as much as possible, without oversubscribing. And remember, I am building a framework, so if the user does not specify how many threads she wants, I'm deciding how many I am going to use."
I do not see why female users would be more willing to sacrifice performance for even division of work... on a machine. :-)

"I am playing with distributed containers if that helps you. I am partitioning my container in n blocks and there are many different partitioning schemes. I can then use the spmd_execute() to apply algorithms on the distributed container. My framework decides how many threads to spawn for executing the algorithm, based on locality, number of available threads etc (yes, I do use knowledge regarding locality so it is not an issue)."
I still see no inherent required concurrency.

"I could allocate my container using the scalable allocator and let TBB do its magic as regards to affinity, but most of the times this is not possible for various reasons."
That sounds mysterious enough for me to withhold comment.

"In OpenMP this thing works but I cannot interface correctly with TBB since 1) it does not allow me to say "i want n threads to execute this thing" and 2) it does not take into account how many threads are currently busy doing OpenMP stuff."
With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear.

"I am also coming to the conclusion that because of the lack of this support (which is a design decision which I am not debating) that I cannot have it my way."
I refer back to my rope analogy (less is more), and wish you good luck.

ipapadop · ‎04-18-2011

"Wrong. TBB deals more efficiently with the opportunity cost of parallel execution."

That's completely orthogonal.

"I do not see why female users would be more willing to sacrifice performance for even division of work... on a machine. :-)"

Gender-neutral pronoun...

"I still see no inherent required concurrency."

Bottom line is: I cannot express the algorithms using only the TBB provided structures; not because I do not want to, because I can't. They are written in a data parallel way and use distributed containers, and rewriting them is not desirable. I have to do the best I can while at the same time, use task parallelism - and then when speedups are there, I can convince users to change their algorithm. And I do not see why the answer to my question "Is this possible?" has essentially to be "Your question is wrong".

"With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear."

Good - at least that's a good lead that I didn't know.