- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To compute a reduction (and that's what this is), think parallel_reduce() instead of parallel_for(). It is of course a very special case, which I think has come up before, so you should be able to find the answer in an earlier thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The expected average number of matches strongly influences what kind of algorithm to use. Do you expect many matches, a few, or less than one in the average case?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That code explains how this can be solved as a reduction, but no chunks will be short-circuited (except that ones incremental to a subrange containing a corresponding element don't need to be traversed). Wasn't there an answer where somebody suggested parallel_do with cancellation or so (might have been me), and then somebody else improved on that (perhaps something with parallel_reduce on remapped elements, unless I just dreamt that)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the match is expected to come early in the sequence, then the threads should concentrate on the leftmost portion of the sequence first. Here's stab at doing so. The key idea is to use an atomic counter to pick up the next chunk. The downside is that fetch_and_add has limited scalability, unless you have a vintage NYU Ultracomputer :-) For for small core counts it might work out okay.
[cpp]// Do "y=x" if it decreases the value of y. templatevoid write_min( tbb::atomic & y, T x ) { for(;;) { T snapshot = y; if( snapshot int search( int n, F pred ) { size_t chunk_size = 10000; tbb::atomic i; i = 0; tbb::atomic result; result = n; tbb::parallel_for( tbb::blocked_range (0,n,chunk_size), [&](tbb::blocked_range r) { int begin = i.fetch_and_add(r.size()); int end = begin+r.size(); // Quick rejection test if( begin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The exact value of the bool doesnt matter, only that it has a value other than 0 is what matters, so this should work no problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No: see #4, first sentence. Also, you wouldn't have a guarantee that threads read the new value if it's not an atomic.
#6 "I don't think parallel_reduce is a solution because it forces to scan all the vector while I wish to stop at the first match (in order)..."
As Arch already indicates, it all depends... The code in #4 looks like a parallel_do (the solution I would propose as an obvious improvement over a straightforward parallel_reduce) in disguise, but I thought there was a way where elements would be interleaved between tasks so they all had a roughly equal chance to find the first element, leading to a quicker discovery, and with better scalability, based on parallel_reduce and a remapping of the elements, unless I'm mixing that up with something else.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The idea is that the parallel_for hands out the total number of iterations that we want, but the iteration values themselves are issued from variable i. The parallel_for is good for handing out iterations in parallel, but tends to hand out iteration subranges [r.begin(),r.end()) that are scattered all over the iteration space. If matches are frequent enough (and this is problem dependent) all work done to the right of the leftmost match is wasted work. So the intuition is to make threads grab iterations from left to right. The variable i is the counter from which they grab those iterations.
For example, suppose n=1001 and chunk_size=300, and there are two threads.The parallel_for with simple_partitioner will subdivide the iteration space into:
- [0,250)
- [250,500)
- [500,750)
- [750,1001)
The peculiar (and non-deterministic) order that parallel_for hands out iterations is the reason that cancellation tricks (e.g. a boolean flag or using TBB's task cancellation features) are not safe to use. If we find a match for index k, we must be sure that we checked all lower values of k for a match too. There's likely a way to modify the collapsing range hack to do the cancellation correctly. The modification would have to replace my_stop with code to report the range as indivisible if it extends past the lowest k found so far.
As Raf pointed out, there are crafty ways to use parallel_do and input iterators. The iterators must not be random-access iterators, otherwise parallel_do acts like parallel_for, to get similar left-to-right bias when processing the iteration space. Though using parallel_do like that is depending somewhat on internal implementation details. I went with the parallel_for approach because it can be adapted to work with Cilk Plus' cilk_for too.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page