- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
void foo(void) {int i;#pragma omp parallel for private(i)for (int i=0; i<10000; ++i) {int tid = omp_get_thread_num(); // current thread idint nth = omp_get_num_threads(); // current number of threads// do sth - even if there are dependencies}}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There may be a more appropriate way to translate those dependencies than simple transliteration, so to say.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or you could still tell us what you want to do instead of how you want to do it and let us suggest a better approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No matter whether you use parallel_for or spawn tasks, in TBB you do not know how many threads will work on it. Threads may join and leave work sharing arena at any time.
You may spawn exactly as many tasks as you need (or expect) threads, each waiting on the same barrier.When youhave no other work for TBB in the app/test, the threads will eventually arrive to the barrier. But, any comparison of this hack with OpenMP does not make sense, because OpenMP is designed and optimized for this team-and-barrier programming style (and has compiler support to do it efficiently) while TBB is not.
By the way, this team-and-barrier style has little relation to data parallelism IMHO. What you want is not parallelism but mandated concurrency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you want to play with data parallelism then maybe you should code up the data parallel portion of your algorithmin Intel Array Building Blocks (part of the interoperable Intel Parallel Building Blocks), a package designed to exploit data parallelism. If that doesn't further your purpose, then perhaps what you really are (unintentionally) talking aboutmandated concurrency, or at least an assurance that a piece of code has been executed by some number of distinct threads.
Intel Threading Building Blocks doesn't work that way. You can think of it as a decentralized scheduler. There's no master controller that knows what all the threads are doing. There is no repository for collecting such statistics (at least in production code)because the cache thrashing required to collect it would kill performance.
You could have a kernel that keeps a tally, marking some slate as each unique thread passes through the code and then determine after the fact how many threads were engaged, but I get the feeling that's not what you want. And even that doesn't guarantee that all the threads that touched it were there simultaneously for long enough tobe called concurrent.
And even if you could get all the available threads to play concurrently, it still probably wouldn't look much like an OMP for-loop. The scheduling will be radically different. If you just use the OMP static scheduler, it will divide your loop of 10,000 into as many pieces as there are available threads (typically either all or one) and give each piece to a thread, which will execute in the specified direction until it's done with its piece, then go to the barrier. Whereas in Intel TBB, one thread will split the list into two pieces of 5,000, split one of those into two pieces of 2,500 and so on until it gets down toa threshold where it executes the piece it has left; concurrently, idle threads will look for these left behind pieces to steal and start the same splitting process upon them. The number of threads executing upon the 10,000 will then grow to some maximal concurrency as threads finish their pieces of the 10,000 and go off to look for other work. At some point that work will start to become scarce, and the available threads will go off to look for other work and eventually idle themselves. There's no even division of the work across the list and if you preturb the threads enough to observe them, it just looks like chaos.
So, not your typical user, can you be more specific on the types of data parallelism with which you want to experiment?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
spmd_execute(Nthreads, f, args...){if (omp_get_level()==0) {int max_nth = min(omp_get_max_threads(), Nthreads);#pragma omp parallel for nowait schedule(static, 1)for (int i=0; i f(args...);}}else {f(args);}}f() might do computations, have OpenMP calls, create nested OpenMP parallel regions and/or call spmd_execute() recursively.Becausef() is executed SPMD, I can know how many threads execute it and what is each thread id. This is crucial for my framework, as well as supporting nested parallelism without oversubscribing. However, specifying exactly Nthreads to spmd_execute() is not a big deal.I am trying to do the same with TBB (in which case I can have my SPMD function f() call TBB inside). I do not want to use threads, as it would oversubscribe the system in case f() is calling TBB.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes you can do that with OpenMP and cannot do that with TBB, which motto is "think in terms of work (i.e. parallelism in your problem) andnot in terms of workers (threads)". TBB helps experssing the parallelism, in the form of tasks or generic parallel algorithms. The notion of threads or a parallel region is not exposed, so you really cannot do what you want without dangerous hacks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In this case when I do task_scheduler_init init(p) and I know that nobody else uses TBB, do I have any guarantees that if I call parallel_for(0, p, f), f() will be executed on exactly p threads?
And if I do spawn_root_and_wait() from a task, when the spawned task is done, will TBB return me to the original or it might try to execute other pending tasks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd try to answer your other question, but I'm not sure I understand it. "return me to the original?" task? If there are other tasks that have been scheduled, the threads in the TBB pool will seek them out. You don't own any threads in TBB, only tasks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your second example (smpd_execute), consider for the sake of argument you enter spmd_execute with a thread at level 0 (main OpenMP thread of process). This establishes (or enlists) the max_nth thread pool to perform the work (as n-way fork). Immediately after this part all the threads will be busy chewing away at function f(args).
Assume also that the work to be performed in f(args) and/or the latency through f(args) is un-equal.Note f(args) could perform I/O, have page faults, or get pre-empted by other process on system. What happens in this situation is some threads complete earlier than others. No problem here (supposedly) because you have nowait on the #pragma omp.... Now let's see what can supposedly happen...
Under this circumstance, when the threads that are not the main thread complete... those threads go idel (may allso burn time in KMP_BLOCK_TIME). Should the main thread complete before the other threads complete, the main thread exits spmd_execute and presumably runs into code the calls spmd_execute again (with same or different f(args)). On this second call it will detect it calls from level 0 and will assume all threads are available. It is quite possible that some of the threads from the first call are still busy, but your code is executing as if theyare available.
In the former case you have idle threads, in the latter case you have a quasi over-subscription of threads on calls from outer most level and under-subscription of potentially soon to become available threads. So, excepting for an application thathas only one nest (recursion) level, your smpd_execute would likely exhibit ineffective use of available cores.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Raf Schietekat: I am playing with distributed containers if that helps you. I am partitioning my container in n blocks and there are many different partitioning schemes. I can then use the spmd_execute() to apply algorithms on the distributed container. My framework decides how many threads to spawn for executing the algorithm, based on locality, number of available threads etc (yes, I do use knowledge regarding locality so it is not an issue).
In OpenMP this thing works but I cannot interface correctly with TBB since 1) it does not allow me to say "i want n threads to execute this thing" and 2) it does not take into account how many threads are currently busy doing OpenMP stuff.I am also coming to the conclusion that because of the lack of this support (which is a design decision which I am not debating) that I cannot have it my way.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
See Reference Manual/Task Scheduler/Scheduling Algorithm: returning from spawn_root_and_wait() has the highest priority unless you explicitly override it, but there is no telling how long a thread may be otherwise occupied once it has stolen a task.
"Any decent runtime should be able to detect how many worker threads are available and divide the work accordingly right?"
Wrong. TBB deals more efficiently with the opportunity cost of parallel execution.
"At least the OpenMP runtimes I have tried do that - if you specify the correct policy, they will only fork as much as possible, without oversubscribing. And remember, I am building a framework, so if the user does not specify how many threads she wants, I'm deciding how many I am going to use."
I do not see why female users would be more willing to sacrifice performance for even division of work... on a machine. :-)
"I am playing with distributed containers if that helps you. I am partitioning my container in n blocks and there are many different partitioning schemes. I can then use the spmd_execute() to apply algorithms on the distributed container. My framework decides how many threads to spawn for executing the algorithm, based on locality, number of available threads etc (yes, I do use knowledge regarding locality so it is not an issue)."
I still see no inherent required concurrency.
"I could allocate my container using the scalable allocator and let TBB do its magic as regards to affinity, but most of the times this is not possible for various reasons."
That sounds mysterious enough for me to withhold comment.
"In OpenMP this thing works but I cannot interface correctly with TBB since 1) it does not allow me to say "i want n threads to execute this thing" and 2) it does not take into account how many threads are currently busy doing OpenMP stuff."
With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear.
"I am also coming to the conclusion that because of the lack of this support (which is a design decision which I am not debating) that I cannot have it my way."
I refer back to my rope analogy (less is more), and wish you good luck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I do not see why female users would be more willing to sacrifice performance for even division of work... on a machine. :-)"
"I still see no inherent required concurrency."

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page