break in parallel_for

timelzayus · ‎12-01-2010

Hello.

I'd like to 'scan' a vector and stop on the first corresponding data.

In serial programming, it would give :

int matchIndex = -1;

for (int i = 0; i < myVector.size(); ++i) {

if (elementMatch(myVector)) {

matchIndex = i;

break ;

}

I don't know how to do this in a parallel_for, can someone help ???

Thank you very much.

RafSchietekat · ‎12-01-2010

To compute a reduction (and that's what this is), think parallel_reduce() instead of parallel_for(). It is of course a very special case, which I think has come up before, so you should be able to find the answer in an earlier thread.

ARCH_R_Intel · ‎12-01-2010

Slides 32-34 of http://www.threadingbuildingblocks.org/ver.php?fid=127 show a closely related example.

The expected average number of matches strongly influences what kind of algorithm to use. Do you expect many matches, a few, or less than one in the average case?

RafSchietekat · ‎12-01-2010

That code explains how this can be solved as a reduction, but no chunks will be short-circuited (except that ones incremental to a subrange containing a corresponding element don't need to be traversed). Wasn't there an answer where somebody suggested parallel_do with cancellation or so (might have been me), and then somebody else improved on that (perhaps something with parallel_reduce on remapped elements, unless I just dreamt that)?

ARCH_R_Intel · ‎12-01-2010

Tricky part is short-circuiting chunks after a match, but not any earlier ones.

If the match is expected to come early in the sequence, then the threads should concentrate on the leftmost portion of the sequence first. Here's stab at doing so. The key idea is to use an atomic counter to pick up the next chunk. The downside is that fetch_and_add has limited scalability, unless you have a vintage NYU Ultracomputer :-) For for small core counts it might work out okay.

[cpp]// Do "y=x" if it decreases the value of y.
template
void write_min( tbb::atomic& y, T x ) {
    for(;;) {
        T snapshot = y;
        if( snapshot
int search( int n, F pred ) {
    size_t chunk_size = 10000;
    tbb::atomic i;
    i = 0;
    tbb::atomic result;
    result = n;
    tbb::parallel_for( tbb::blocked_range(0,n,chunk_size),
                       [&](tbb::blocked_range r) {
                           int begin = i.fetch_and_add(r.size());
                           int end = begin+r.size();
                           // Quick rejection test
                           if( begin

smasherprog · ‎12-01-2010

Wouldn't it be cheapest to use a standard bool data type and check that before work is done? Once something is found, different threads can change the bool to true, and all other threads will know to stop. No need for atomics or anything else, if one thread misses the fact that is should stop, then it shouldnt be a big deal, the extra work will depend on where you place the break check.

The exact value of the bool doesnt matter, only that it has a value other than 0 is what matters, so this should work no problem.

timelzayus · ‎12-01-2010

Thank you very much.

I'll go with that kind of solution.

I don't think parallel_reduce is a solution because it forces to scan all the vector while I wish to stop at the first match (in order)...

RafSchietekat · ‎12-01-2010

#5 "Wouldn't it be cheapest to use a standard bool data type and check that before work is done? Once something is found, different threads can change the bool to true, and all other threads will know to stop. No need for atomics or anything else, if one thread misses the fact that is should stop, then it shouldnt be a big deal, the extra work will depend on where you place the break check."
No: see #4, first sentence. Also, you wouldn't have a guarantee that threads read the new value if it's not an atomic.

#6 "I don't think parallel_reduce is a solution because it forces to scan all the vector while I wish to stop at the first match (in order)..."
As Arch already indicates, it all depends... The code in #4 looks like a parallel_do (the solution I would propose as an obvious improvement over a straightforward parallel_reduce) in disguise, but I thought there was a way where elements would be interleaved between tasks so they all had a roughly equal chance to find the first element, leading to a quicker discovery, and with better scalability, based on parallel_reduce and a remapping of the elements, unless I'm mixing that up with something else.

smasherprog · ‎12-01-2010

You can use a global non atomic variable if the following is true: The elementMatch function is small, only a few instructions, then a non atomic boolean variable will give you a speed increase (a small one, but it all adds up). If your elementMatch function is complex, then use an atomic variable. The reason is that atomic variables do guarantee serial access, but in your case, it might not be needed. Remember, atomic variables are not equal to regular non atomic variables.

RafSchietekat · ‎12-01-2010

Writing or reading an atomic variable is not slower than writing or reading a normal variable... unless the compiler optimises access to the normal variable away, but that kind of defeats the purpose. :-)

timelzayus · ‎12-01-2010

I don't understand the purpose of thetbb::atomici variable.

nor why you useintbegin=i.fetch_and_add(r.size());intend=begin+r.size();

instead of begin = r.begin(); end = r.end();

Could you explain ???

Thank you.

ARCH_R_Intel · ‎12-01-2010

The idea is that the parallel_for hands out the total number of iterations that we want, but the iteration values themselves are issued from variable i. The parallel_for is good for handing out iterations in parallel, but tends to hand out iteration subranges [r.begin(),r.end()) that are scattered all over the iteration space. If matches are frequent enough (and this is problem dependent) all work done to the right of the leftmost match is wasted work. So the intuition is to make threads grab iterations from left to right. The variable i is the counter from which they grab those iterations.

For example, suppose n=1001 and chunk_size=300, and there are two threads.The parallel_for with simple_partitioner will subdivide the iteration space into:

[0,250)
[250,500)
[500,750)
[750,1001)

The parallel_for will execute [0,250) and [500,750)first. Theother twosubrangescome next, possibly in either order.Wewant, however, tosearch the first 500 elements first.So by using the fetch_and_add, we pick up [0,250) from the first fetch_and_add and [250,500) from the second fetch_and_add. The third fetch_and_add later willpick up [500,750) or [500,751), depending on which parallel_for subrange comes next.

The peculiar (and non-deterministic) order that parallel_for hands out iterations is the reason that cancellation tricks (e.g. a boolean flag or using TBB's task cancellation features) are not safe to use. If we find a match for index k, we must be sure that we checked all lower values of k for a match too. There's likely a way to modify the collapsing range hack to do the cancellation correctly. The modification would have to replace my_stop with code to report the range as indivisible if it extends past the lowest k found so far.

As Raf pointed out, there are crafty ways to use parallel_do and input iterators. The iterators must not be random-access iterators, otherwise parallel_do acts like parallel_for, to get similar left-to-right bias when processing the iteration space. Though using parallel_do like that is depending somewhat on internal implementation details. I went with the parallel_for approach because it can be adapted to work with Cilk Plus' cilk_for too.