Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2466 Discussions

concurrent parallel_for and task execution order

Jeffrey_B_
Beginner
809 Views
i would like to understand how parallel_for executes tasks when concurrent parallel_for calls are made from independent threads of a process.
As an example. If I have thread A calling parallel_for with items 1..10 that I would like parallel execution and thread B calling parallel_for with items 2..20 then what is the overall order. I would like 1..10 to be executed in parallel and then 2..20 executed or vice verson. Think of 1..10 and 2..20 as batches. I want the batches executed in parallel but not intermixed in any way.
If this is not possible with simple parallel_for call then is there a way to do it other than putting a mutex lock before parallel_for execution and release after it returns?
I tried to read the parallel_for and task/scheduler code but it is somewhat unreadbale because of virtual function and heavy recursion. I don't see any locks for ordering so I suspect that no ordering can be expected.
Thanks.
/JMB
0 Kudos
5 Replies
Andrey_Marochko
New Contributor III
809 Views
Two parallel_for calls made from different thread will proceed concurrently with TBB worker thread pool split between them. If you need to prevent batches from overlapping, you'll have to use manual synchronization, e.g. tbb::mutex. The latter is a wrapper over an OS primive that parks a thread in the OS kernel for the time of wait. I assume your parallel_for workloads are large enough to amortize the cost of user/kerlel mode transition (otherwise parallelization wouldn't have made sense).

Better yet, change the logic of your application to make parallel_for invocations one after another in the same thread. Or consider using tbb::pipeline with the parallel_for in question placed inside a serial filter.
0 Kudos
Jeffrey_B_
Beginner
809 Views
Thanks Andrey. You gave me confirmation on what I expected. I wish tbb had the notion of a context (not group context) but execution context that would allow demarcation lines of execution. Think of this as a pool with priority where each invocation of parallel_for was given a priority one lower than the previous invocation. The pool would then execute in priority order and not allowed to execute a lower priority item until all higher priority items were complete. However, thinking about this out load, the behvaior I describe is sort of the mutex at the front end solution.
I will probably use mutex for now and then experiment with my own thread pool to see if there is any gain that can be achieved.
0 Kudos
Alexey-Kukanov
Employee
809 Views

I have some questions to you JMB.
-How do you determine which job is more important - is it strictly by the order of start? What if two paralleljobs are started at nearly the same time, and so race for priority?
- When a master thread is unable to start the job because a previous one is not complete yet, what should this thread do ideally - just wait for its turn, delegate job submission to some other thread and continue asynchronously, or maybe help processing the currently running job?

0 Kudos
Jeffrey_B_
Beginner
809 Views
Alexey,
Let me try to explain a bit more. I have n threads processing stuff concurrently (and independent of each other, i.e. there is also no sharing between threads) and at the end each produce a list that needs to be executed against a function. The relative order of when they finish is not important and not deterministic (for example, T1 may have started processing after T2 but T1 might have finished first). What I want to do is process the list produced by each thread as a batch, i.e. all items in the batch must be completed before I can process any items from another batch.
So, it is easy to use an atomic counter or cas to produce priority order, e.g. (sorry for the the goto but easiest to explain in pseudo code)
tryagain:
p = priority;
if (priority.cas(p,p-1) == true) // let's not worry about overflow of p for now
{
invoke parallel for with priority of p - 1
}
else
{
gototryagain;
}
the reason I suggested that what I proposed is not much better than mutex is exactly because the answer to your last question is not a good one. you have to wait so there is no real difference between waiting at the parallel_for invocation on a mutex and calling parallel_for and having it wait.
thinking about this a bit more I think I have an idea for a good solution but it is very app specific. What I have is a list of stuff that I need to logically do 2 things on. One is produce a result and the second is send that result off and it is the later step that requires the strict ordering I mentioned. To achive max parallelism I want to produce the results in parallel without regard for order but I want to send the results with the ordering I described.
So, I think it is possible to execute parallel_for without mutex on the batch which will only produce result and then take the output of the parallel_for get a mutex and execute the sending of the result. this way I get parallel_for for production of results and mutex is only taken for sending of results and I can send the batch as a whole with strict ordering that i need. I believe this achieves max parallelism because it is production of result that takes the longest amount of time and thus I want as many threads executing as possible. The result sending doesn't even need parallel_for because it is fairly quick.
0 Kudos
Alexey-Kukanov
Employee
809 Views
Ok I see; thanks.
It does not at all sound application specific :) It sounds like a somewhattypical pipeline with 3 stages: a list is produced, processed, and sent to a sink. The processing takes most of time, and independent pieces can be processed simultaneously (and in your case, processing also contains internal parallelism that you exploit with parallel_for). If the requirement to produce the list in different threads is not essential, I would recommend looking into tbb::pipeline, or maybe tbb::graph preview feature. And in any case, separation of processing and output, if possible, is a right thing to do. Great that the discussion helped you to find a good solution :)
0 Kudos
Reply