Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Priority of items within a parallel_for

Jonathan_J_
Beginner
485 Views

I have a parallel_for loop executing various tasks in parallel.  It is known ahead of time that certain tasks among those tasks will take a little longer than other tasks and require more computation.  Is it possible to assign higher priority to certain tasks in a parallel for loop?

Thanks!

If it helps, here is some (pseudo)code for my parallel_for loop below:

[cpp]

class RunParallelSims{
public:
     Sim *sims;

    void operator()(const tbb::blocked_range<size_t>& r) const {
        Simulation *sims_ = sims;

        for(size_t i = r.begin(); i != r.end(); ++i) {
            sims_->run(); //here is where i would like to be able to examine sims_ and set a priority on it, if anything like that is possible
        }
    }  

     RunParallelSims(Simulation_c *_sims) :
     sims(_sims),
    {}

class SimulationRunner {

    int main(int argc, char *argv[]) {

        //...various code omitted here for simplicity

        Sims arrayOfSimulations[]; //an array of simulations, where some simulations are known to have higher computation time

        tbb::parallel_for(tbb::blocked_range<size_t>(0, numSims,1), RunParallelSims(arrayOfSimulations));

    }

}

[/cpp]

0 Kudos
8 Replies
RafSchietekat
Valued Contributor III
485 Views

There's no preferential treatment to be had from parallel_for(), but maybe you could try parallel_do() with a concurrent_priority_queue?

0 Kudos
jimdempseyatthecove
Honored Contributor III
485 Views

Another option is to do something completely different.

In your Simulation, where you have run(), add runLow() and runHigh(). runLow() { if(isLowPriority()) run(); }, and runHigh() { if(!isLowPriority()) run();}

Then use parallel_invoke forking two secions of parallel_for, one issing runLow, the other issuing runHigh.

While this isn't true prioritization,c it may yield a satisfactory effect.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
485 Views

Wouldn't it actually be counterproductive to put all the long tasks in one side of a parallel_invoke?

To prevent any misunderstanding: I'm assuming that the longer tasks cannot benefit from recursive parallelism (where you subdivide them to create parallel slack), and that they are too long to ignore at the end of the computation (better to first fill the glass with pebbles and only then with sand). In that case the ordinarily somewhat less efficient parallel_do() comes into play, but I don't see why I would have mentioned concurrent_priority_queue, because a simple vector should suffice here (order from big to small, cook until done).

0 Kudos
jimdempseyatthecove
Honored Contributor III
484 Views

>>Wouldn't it actually be counterproductive to put all the long tasks in one side of a parallel_invoke?

Each side of the parallel_invoke is performing a parallel_for. Presumptively on my part, each side will begin running with at least one thread prior to the other side finishing (when system has more than one thread). Once both sides start running, then at least one thread from each side will process partitions of the parallel_for of that side. The remainder of partitions of each side are up for grabs from the rest of the threads of the thread pool (assuming there are more than 2 threads). When one of the sides finishes its thread is available to continue stealing parallel_for slices from the other side.

As to if this technique satisfies the O.P. it is up to him to decide. The above technique is but one way to balance loads to a small degree. Your split point is variable too. You might want to set the split point where the total work on the small side exceeds the total work on the large side.

Not enough is known about the problem to offer a best solution.

a) are new tasks generated while processing starting set of tasks?
b) how much computation is there in a large task as compared to mutex, push_back, ...?
c) how much computation is there in a small task as compared to mutex, push_back, ...?
d) how many threads are available?
e) do any of the tasks stall (e.g. disk I/O)
f) ...?

An alternate aproach

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
484 Views

If the idea is to get the coarser tasks out of the way before starting with the finer tasks, the parallel_invoke() proposal would prevent a mixture of coarse and fine tasks in the ready pool of any one thread, so no coarse task would be waiting behind any number of fine tasks within any one thread's ready pool, but only a fraction of the workers would be working on the coarse tasks while others would be specialising in fine tasks, and in that sense the goal is not fulfilled. I should probably not have written "counterproductive", because I don't see how this compares with a random distribution, but it does seem to be suboptimal.

For my own suggestion, perhaps there should be some grouping of tasks to reduce overhead in the parallel_do(), each group to be executed with parallel_for(), but with at least a few times as many groups per priority level as there are physical threads and perhaps the coarsest tasks in singleton groups, to prevent a significant amount of work being undertaken on a lower priority before the higher priority has finished executing. But that's a big complication, so the potential benefit would have to be worth it. Let's first see what a straightforward parallel_do() of individual tasks does, so at this point I'd like to hear back from Jonathan.

Note that we'e using "task" in a general sense here, which should probably be avoided, but I couldn't think of another single word.

0 Kudos
Jonathan_J_
Beginner
484 Views

Sorry for the long delay, we were actually pursuing some alternatives to TBB, just as kind of a side route.  To answer Jim's questions above:

a) are new tasks generated while processing starting set of tasks?
Nope 

b) how much computation is there in a large task as compared to mutex, push_back, ...?
There are no mutex's; each task is completely independent of each other 

c) how much computation is there in a small task as compared to mutex, push_back, ...?
There are no mutex's; each task is completely independent of each other

d) how many threads are available?
As many as we want... at the moment there are 6 cores; we are probably going to avoid hyperthreading as there is no task stall, so each core will be working on a task continuously, disabling any benefit from hyperthreading

e) do any of the tasks stall (e.g. disk I/O)
No

To give you more of an idea of the task, we are essentially processing a large number of simulations in tandem.  We want the simulations to proceed step-by-step in regards to the frame number. I.e. Each simulation should run frame 1 together, then once all have completed frame 1, each one proceeds to frame 2, etc etc.  However, each simulation has a (relatively) variable amount of work to do in a frame.

It might be a little while before I can get you any numbers on parallel_do, I'm getting side tracked onto another task but I will get results from that as soon as possible. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
484 Views

Jonathan,

The point of asking b) and c) is not to query as to if you use a mutex, rather it was for the purpose of should a mutex be introduced how would the additional overhead relate to the code execution time. IOW if run times are quite long, then +mutex time is immaterial.

Your answer to d) and your descriptive paragraphs (first post, last post) conflict.

The jist that I gather is:

a) you want frame-by-frame processing
b) a frame can be decomposed into several (many) tasks
c) these task have varying complexity (different run times)
d) the number of tasks is not related to the number of cores/HW threads
e) you want to maximize the utilization of the cores/HW threads

Assume you know you have 6 hw threads:

volatile intptr_t i = 0;
parallel_invoke(
[&](){ for(intptr_t my_i = sync_fetch_and_add(&i); my_i < N; my_i = sync_fetch_and_add(&i)) doWork(my_i); },
[&](){ for(intptr_t my_i = sync_fetch_and_add(&i); my_i < N; my_i = sync_fetch_and_add(&i)) doWork(my_i); },
[&](){ for(intptr_t my_i = sync_fetch_and_add(&i); my_i < N; my_i = sync_fetch_and_add(&i)) doWork(my_i); },
[&](){ for(intptr_t my_i = sync_fetch_and_add(&i); my_i < N; my_i = sync_fetch_and_add(&i)) doWork(my_i); },
[&](){ for(intptr_t my_i = sync_fetch_and_add(&i); my_i < N; my_i = sync_fetch_and_add(&i)) doWork(my_i); },
[&](){ for(intptr_t my_i = sync_fetch_and_add(&i); my_i < N; my_i = sync_fetch_and_add(&i)) doWork(my_i); });

This adds the overhead of an XCHGADD, but this is all.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
485 Views

N.B.

Sort the N tasks by runtime High to Low.
Replace "doWork(my_i)" with "YourOrderedQueue[my_i]->run();" or whatever your pleasure.

This way, large tasks run first, smaller tasks fill in the blanks.

In lieu of parallel_invoke, you could loop launch number of threads used in thead pool. parallel_invoke is faster from my past experience.

If multiple frames can be worked on independently (concurrently), then consider using parallel_pipeline on the frames.

Jim Dempsey

0 Kudos
Reply