Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

parallel_find()

Walter_D_
Beginner
1,114 Views

TBB does not provide a parallel_find(), which would act similar to std::find(): find the first element in a range meeting a given condition/predicate. I think those who know the tbb library well should be able to implement this as follows: a range is split into a first and second half with the first half searched at higher priority (put on the stack last). If the first child task succeeds (finds a match), the second task is cancelled. This is done recursively, of course.

Could you consider implementing this? I'm not sure how to implement the cancellation of the second child task if the first succeeds.

0 Kudos
27 Replies
Stephan_Dollberg
Beginner
957 Views

Hi Walter,

I wrote a quick implementation using an atomic<bool> to indicate a hit in any child.

Let me know what you think about it.

You can find it here.

0 Kudos
jimdempseyatthecove
Honored Contributor III
957 Views

Consider writing a recusive routine using parallel_invoke. Pass in a (vector, lower bound, upper bound, fnChriteria). Keep splitting bounds until threshold based on either separation of bounds or depth of recursion. Return condition for not found in either branch or the lower index of the branch with find. QED

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
957 Views

RE: Cancellation

You can change the function from returning the index of the found, to passing in the address of the volatile index. Initialize the found index to not found (e.g. 0xFFFFFFFF). Termination then is indicated by the found index being .lt. your lower bound. Should you find a match, then use a CAS provided that the found entry is .lt. what is now in the found index.

Jim Dempsey

0 Kudos
Walter_D_
Beginner
957 Views

I know how I can indicate termination. My problem is that with TBB I don't know how precisely to peform it. How can the task running the lower-half of the range cancel its sibling task (which runs the upper half)? And, how can the cancellation be guaranteed to propagate down the recursive task tree, should the sibling task already have executed and spawned children?

Another option, which is perhaps what you were thinking about and which does not use cancellation, is to use an atomic variable to store the current best guess (initialised to initial_range::end()) for the iterator of the first found element. If that is less than current_range::begin(), the current_task doesn't need to run. But this requires atomic operations on the iterator -- can we do that? It would restrict the possible template arguments that can be used with parallel_find(). For iterators not allowing atomic operations, we could use a mutex.

0 Kudos
Walter_D_
Beginner
957 Views

Hi Stephan,

I just only now saw your reply. Yes, that worked. However, even on 16 cores, it's only really superior to serial if N>~10^8 and if the searched for element is at index I > ~10^7. For small N or I, it's actually substantially slower than serial code ...

0 Kudos
Stephan_Dollberg
Beginner
957 Views

I think my answer didn't get unlocked immediately.

Yes, but I think that you probably can't get better than that. Searching is not really a computational intensive task and if all you are doing is iterating through cache and doing a single compare, most of the time is probably wasted by threading overhead. When the dataset is large though, I think you can get quite a performance boost in the best case. If you are doing a lot of searches, a concurrent datastructure like a concurrent hash-map might be a better fit.

A plain implementation without workstealing which simply splits itself recursively(e.g.: parallel_invoke as Jim proposed) might even outperform the task-approach here.

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

This looks like a job for a pipeline with a serial input stage that does a serial search through the first few pages and then as its gets bored starts delegating reasonably-sized subranges to later stages (again a few pages each, perhaps growing later on), a parallel stage that finds the earliest occurrence in its own subrange, an ordered serial output stage that looks for the first successful outcome, and a task_group_context for cancellation by the output stage (interpretation by the middle stage might work too). Slightly more advanced would be to interpret default_num_threads() and use interleaving (each data item represents a regular set of staggered subranges, either aligned at cache lines or sufficiently long to just amortise cache misses), thus increasing the chance of finding an answer sooner without using more tasks, and to be sure to interpret cancellation frequently enough in the middle stage. Well, it's what I would try, anyway...

(Added 2013-04-22) Maybe without that interleaving idea, though.

0 Kudos
jimdempseyatthecove
Honored Contributor III
957 Views

>> How can the task running the lower-half of the range cancel its sibling task (which runs the upper half)?

The found index (initialized to 0xFFFFFFFF or 0x7FFFFFFF), which is set with atomic operation (CAS) is used as the termination indicator.


The following is pseudo code
(won't work as-is)

[cpp]
void parallel_find_helper(SomeVector, fnSomeTest, iBegin, iEnd, pFound)
{
  if(*pFound .lt. iBegin) return;
  if(iEnd - iBegin .lt. cutoff)
  {
     for(int i=iBegin; i .lt. iEnd; ++i)
     {
        int OtherFound = *pFound; // freshen
        if(i .lt. OtherFound) return;
        if(fnSomeTest(SomeVector))
        {
          // match
          while(!CAS(pFound, i, &OtherFound))
          {
             if(OtherFound .lt. i) return;
          }
          return; // success, result in *pFound (ours or some other .lt. i)
      }
      return; // fail
  } // if(iEnd - iBegin .lt. cutoff)
  // recurse
  int iSplit = iBegin + ((iEnd-iBegin)/2);
  parallel_invoke(
    []() { parallel_find_helper(SomeVector, fnSomeTest, iBegin, iSplit, pFound); },
    []() { parallel_find_helper(SomeVector, fnSomeTest, iSplit, iEnd, pFound); });
}

int parallel_find(SomeVector, fnSomeTest)
{
  int Found = 0x7FFFFFFF; // not found
  parallel_find_helper(SomeVector, fnSomeTest, SomeVecto.begin(), SomeVector.end(), &Found);
  return Found;
}

[/cpp]

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

I disagree: the result is more likely to be found in the first half than in the second half, and real cancellation would entirely prevent execution for most tasks.

0 Kudos
Walter_D_
Beginner
957 Views

Jim,

your code does not cancel any scheduled/running tasks. Rather, it uses an atomic variable as indicator, which each task has to constantly read in order to effectuate something like cancellation. With many threads, this results in a heavy atomic load and hence will not scale well.

Raf,

re your earlier post. Surely, the problem of parallel finding is well researched. The strategy I was advocating is essentially what Anthony Williams uses in his "C++ concurrency in action" (chapter 8.5.2), using std::async(). He also commented on the problem of using an atomic indicator, as the "atomic loads can slow operations".

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

I also had a remark about that atomic before I surreptitiously removed it because I don't see why reading it should necessarily be such a burden... provided of course that you do read it before attempting a CAS (don't rely on any rumours about optimisation in hardware)! If the strategy is otherwise sound (searching primarily near the beginning), there shouldn't be too many updates of that atomic variable before you settle on the final result.

Using reseaurch results is more professional but also less fun than reinventing the wheel on the spot. :-)

The strategy with halving the input range won't work as intended with TBB where stealing is breadth-first. Maybe Anthony Williams is using another toolkit that works differently and therefore more closely resembles my proposal, even if it wastes creating a task for the work currently set aside? But somehow you have to know for certain when the result is in, and with TBB a pipeline seems a natural fit. That's not to say that you couldn't possibly improve the outcome marginally by also communicating through a shared (atomic) currently best result. I'll leave that up to you, if you wish to find out for yourself (but do tell us the outcome!).

Of course, all attempts are futile if memory bandwidth is saturated first, depending on the search predicate.

(BTW, you don't need atomic iterators if they're random access, because then atomic<ptrdiff_t> will help you navigate anywhere in an instant.)

0 Kudos
Stephan_Dollberg
Beginner
957 Views

Anthony Williams uses the same approach with an atomic<boo> flag. He plainly splits recursively and checks after every comparison whether the done flag was set.

I like raf's idea about doing a preliminary scan on the first elements. Nevertheless I think that searching is just not the best task for parallelism.

0 Kudos
jimdempseyatthecove
Honored Contributor III
957 Views

>>

Walter,

>>your code does not cancel any scheduled/running tasks. Rather, it uses an atomic variable as indicator, which each task has to constantly read in order to effectuate something like cancellation. With many threads, this results in a heavy atomic load and hence will not scale well.

I would argue that the cancellation of scheduled/running tasks has higher overhead than reading a shared variable. cancellation requires traversal of list of stacked threads who's status/state are effectively atomic variables. I suspect that explicit thread cancellation has higher overhead than letting thread run to point of observing earlier index find detected. The length of the run of the lowest level task is the (remainder of) run time of the non-forking portion of the search.

With non-(task cancellation), i.e. with atomic found variable, the shared variable gets evicted from L1 cache of other threads only on change, change only occurs when new found has lower index. So let's estimate the number of CAS operations:

Example scenario:

Assumed vector to be searched is partitioned into 1024 parts (tasks)

Absolute worst case could be 1024 CAS operations.
Average is NOT 512

Assuming the 1024 partitions are randomly run with the available thread pool (non-random can be illustrated at a different time).

First update of atomic variable holding found index:
First thread to reach find would occur at average partition 512.
All threads running in partitions above 512 would abort task as soon as found index was observed (immediately after fnMatch).
There is a very small window (2-3 statements) where a CAS might attempt to be performed when unnecessary. This may be negligible.
All subsequent stacked tasks in partitions above 512 would run no further than first test of found index.

Second update of atomic variable holding found index:
First thread to reach find would occur at average partition 256.
... (see first update, adjust for smaller range)

Third... First thread to reach find would occur at average partition 128.
Forth... First thread to reach find would occur at average partition 64
Fifth... First thread to reach find would occur at average partition 32
...
IOW, number of CAS's average to find ~log2(number of partitions)-1
In this case, 9 CAS

A more optimal search strategy would be to run the thread pool in manageable chunks from the low end of the vector onwards. While using similar atomic found index. The parallel_pipeline that Raf suggested could be useful in this regard, however a slightly different forking function may function better. Tests would have to be done to work out the better technique.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

Stephan Dollberg wrote:

Anthony Williams uses the same approach with an atomic<boo> flag. He plainly splits recursively and checks after every comparison whether the done flag was set.

Are these parts safely set aside on a stack with serial access so that a lot of threads won't plunder it and defeat the purpose (thread gets top item that's way too big because other threads have taken the smaller ones already, splits it, puts back the other half in the wrong order because it's bigger than the current top and also succeeds it, next thread gets the wrong top, claims success, and the real winner is still on the stack)? What's the cost of the locking to prevent all these mishaps (don't forget convoying)? Once you add the necessary mutual exclusion, how could this be any better than just taking a fixed-size piece each time? It just doesn't make sense, except perhaps as a poor man's version of TBB to find a hit anywhere in the input range, which is quite different from finding the first hit.

Stephan Dollberg wrote:

I like raf's idea about doing a preliminary scan on the first elements. Nevertheless I think that searching is just not the best task for parallelism.

Don't use it to beat Boyer-Moore and expect to win. But maybe you're doing something more challenging that doesn't incur a memory bottleneck first and where the pickings are slim so that the parallel overhead is a price you want to pay? There is no absolute answer, it all depends. That's why they (also) call it software engineering. :-)

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

Jim Dempsey wrote:

I would argue that the cancellation of scheduled/running tasks has higher overhead than reading a shared variable. cancellation requires traversal of list of stacked threads who's status/state are effectively atomic variables.

I agree (see higher) that an atomic might well be an appropriate communication mechanism, because it will soon settle on the right answer iff the strategy is otherwise sound. But cancellation is actually quite cheap if the current task group context does not have descendants, because then traversal of the scheduler is cut short before it really gets started. In a crowded environment, cancellation should not be overlooked.

I do maintain that no recursive splitting of the input range is a sound strategy to find the first hit.

0 Kudos
Stephan_Dollberg
Beginner
957 Views

@Raf

#1: I don't know if I understand you 100%. He uses recursive binary splitting till a certain threshold is reached, on joining both halfs, he prefers the left element so that the first element is gonna be selected. For every split he spawns a new task for the right half using std::async. So yeah, he has some implicit locking in the std::future. I am not a big fan of his approach either because he also checks after every comparison. This again might of course be ok, if the checks aren't a simple value comparison as you mentioned in your second part. Though, only measurements can prove any of this.

#2: :) You are right of course, my over-generalization was bad. In such a case, my approach from above should work, too, the more difficult thing which probably comes up here is to select the right threshold for more complex and more simple comparisons.

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

Well, that's not very smart of "him", is it? Why would you possibly want to touch the right half before you have exhausted the left half, unless you know there's at most one item in the whole collection, in which case the challenge of finding the first qualifying element and that of finding any qualifying element become one and the same (you can very easily do the latter with parallel_for and cancellation)? Without such knowledge, you also have to keep track of where you have looked already (that's the exhaustive bit mentioned above), implicitly by waiting or explicitly by... well, let's not even go there! In any case, you're wasting performance this way, and with an atomic<bool> it's even worse because it will delay the moment of communication until you've found the final result (it would be just like cancellation) instead of being able to let others know they're on the wrong track (searching to the right of the current best result).

I think you may have misinterpreted this (first vs. any), or else it's just a demonstration of technique (like TBB and Fibonacci), or else you should stay well clear of this person's software and stop reading his books! ;-)

(Added) I've had a look at your program, and if the challenge is big enough for any stealing to occur, a task searching a later part of the collection may set "done" to true, causing a task about to search an earlier part to exit early and return NULL, and then, unless perhaps if there's also a hit further to the left, NULL will be the outcome of the algorithm, instead of the real first hit, or any hit at all really. Instead, you should return "end" to exit early, and then the program will give you any hit in the collection, with a mostly useless bias toward any elements near the start (those are visited before stealing starts) but otherwise just random (in the traditional sense).

0 Kudos
Stephan_Dollberg
Beginner
957 Views

Yeah, I was thinking of "first" more like "give me any as fast as possible and if you have currently more than one hit give me the one most to the left".

0 Kudos
RafSchietekat
Valued Contributor III
957 Views

Most peculiar... what relevance could that possibly have?

Also, I think Walter's original question was rather unambiguous, and it was he who first referred to that book.

I just noticed that I have overlooked Walter's question about cancellation. In TBB, there is no preemption of tasks, neither for scheduling nor for cancellation. Instead, a task must regularly consult the cancellation status of its own task group context, so in that regard it's the same as communicating by atomic<bool>. Where it differs is that those contexts can be nested hierarchically, with probably many tasks being associated with one context, and cancelling one context means cancelling all descendants as well, so effectively all tasks associated with all those contexts. A selling point for using this mechanism is how it is built in to the prepackaged algorithms, so mostly you only have to declare a context, hook it up to an algorithm, and possibly cancel it.

0 Kudos
Stephan_Dollberg
Beginner
902 Views

To me it seems more natural. To make it explicitly clear and to avoid confusion, Anthony Williams also explicitly states that he is looking for "any" element and not the very first one.

Using a task_group_context seems like another good idea for both cases.

0 Kudos
Reply