- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hello,
Here my code :
class TBBFunctor
{
public :
TBBFunctor() {}
void operator()( tbb::blocked_range< std::set< int >::const_iterator > & range )
{}
};
int main( ... )
{
std::set< int > vec;
tbb::parallel_for( tbb::blocked_range< std::set< int >::const_iterator >( vec.begin(), vec.end() );
}
VS2008 found an error :
tbb\\blocked_range.h(90) : error C2678: binary '<' : no operator found which takes a left-hand operand of type 'const std::_Tree<_Traits>::const_iterator' (or there is no acceptable conversion)
I have the same error with std::list
I don't understand ?
Could you help me ??
Thanks...
Here my code :
class TBBFunctor
{
public :
TBBFunctor() {}
void operator()( tbb::blocked_range< std::set< int >::const_iterator > & range )
{}
};
int main( ... )
{
std::set< int > vec;
tbb::parallel_for( tbb::blocked_range< std::set< int >::const_iterator >( vec.begin(), vec.end() );
}
VS2008 found an error :
tbb\\blocked_range.h(90) : error C2678: binary '<' : no operator found which takes a left-hand operand of type 'const std::_Tree<_Traits>::const_iterator' (or there is no acceptable conversion)
I have the same error with std::list
I don't understand ?
Could you help me ??
Thanks...
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It occurred to me that you might be best off using parallel_do instead of parallel_for, because parallel_do is designed to work with forward iterators.
However, since I said I'd post a range that worked, here is the example with it. I kept the implementation simple at some expense of usability. The user is expected to pass in the correct size of the range (see assertion). A more elegant implementation would have a constructor that took the container as an argument.
Notice the line "Might be scaling issue". This is another reason for blocke_range to not support forward iterators, because in general the time to traverse the range from begin to end would be inherently linear (not parallelizable) and thus presents a scalability issue.
I suggest trying parallel_do first, since it is specifically designed for forward_iterators.
However, since I said I'd post a range that worked, here is the example with it. I kept the implementation simple at some expense of usability. The user is expected to pass in the correct size of the range (see assertion). A more elegant implementation would have a constructor that took the container as an argument.
[cpp]#include "tbb/tbb.h"Template class blocked_range is an odd beast, because it is designed for two kinds of arguments: random access iterators or integers. Because of the integer case, it uses operations like "+" and "-" instead of using the STL "advance" and "distance".
#include
#include
#include
template
class forward_range {
I my_begin;
I my_end;
size_t my_size;
public:
I begin() const {return my_begin;}
I end() const {return my_end;}
bool empty() const {return my_begin==my_end;}
bool is_divisible() const {return my_size>1;}
forward_range( I first, I last, size_t size ) : my_begin(first), my_end(last), my_size(size) {
assert( size==size_t(distance( first,last )));
}
forward_range( forward_range& r, tbb::split ) {
size_t h = r.my_size/2;
my_end = r.my_end;
my_begin = r.my_begin;
advance( my_begin, h ); // Might be scaling issue
my_size = r.my_size-h;
r.my_end = my_begin;
r.my_size = h;
}
};
class TBBFunctor {
public :
TBBFunctor() {}
void operator()( forward_range< std::set< int >::const_iterator > & range ) const
{
for( std::set< int >::const_iterator i=range.begin(); i!=range.end(); ++i )
std::printf("%dn",*i);
}
};
int main( )
{
std::set< int > vec;
for( int i=0; i<20; ++i )
vec.insert(i);
tbb::parallel_for( forward_range<:SET>< int >::const_iterator >( vec.begin(), vec.end(), vec.size() ), TBBFunctor() );
}
[/cpp]
Notice the line "Might be scaling issue". This is another reason for blocke_range to not support forward iterators, because in general the time to traverse the range from begin to end would be inherently linear (not parallelizable) and thus presents a scalability issue.
I suggest trying parallel_do first, since it is specifically designed for forward_iterators.
Link Copied
12 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The posting editor may have mangled your code - there is an argument missing for parallel_for.
Try using the "Syntaxhighlighter" when pasting code. It's the yellow marker icon on the menu, just before the omega. When using the Syntaxhighlighter, be sure to set the language to "C++". I often forget to do this and the default "bash" does bad things to C++.
Try using the "Syntaxhighlighter" when pasting code. It's the yellow marker icon on the menu, just before the omega. When using the Syntaxhighlighter, be sure to set the language to "C++". I often forget to do this and the default "bash" does bad things to C++.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I patched the example up enough to compile it. The problem is that blocked_range expects a random-access iterator (necessary for blocking), but set::iterator is only a bidirectional iterator.
It's easy enough to create range to do what you want. I'll post one shortly.
It's easy enough to create range to do what you want. I'll post one shortly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It occurred to me that you might be best off using parallel_do instead of parallel_for, because parallel_do is designed to work with forward iterators.
However, since I said I'd post a range that worked, here is the example with it. I kept the implementation simple at some expense of usability. The user is expected to pass in the correct size of the range (see assertion). A more elegant implementation would have a constructor that took the container as an argument.
Notice the line "Might be scaling issue". This is another reason for blocke_range to not support forward iterators, because in general the time to traverse the range from begin to end would be inherently linear (not parallelizable) and thus presents a scalability issue.
I suggest trying parallel_do first, since it is specifically designed for forward_iterators.
However, since I said I'd post a range that worked, here is the example with it. I kept the implementation simple at some expense of usability. The user is expected to pass in the correct size of the range (see assertion). A more elegant implementation would have a constructor that took the container as an argument.
[cpp]#include "tbb/tbb.h"Template class blocked_range is an odd beast, because it is designed for two kinds of arguments: random access iterators or integers. Because of the integer case, it uses operations like "+" and "-" instead of using the STL "advance" and "distance".
#include
#include
#include
template
class forward_range {
I my_begin;
I my_end;
size_t my_size;
public:
I begin() const {return my_begin;}
I end() const {return my_end;}
bool empty() const {return my_begin==my_end;}
bool is_divisible() const {return my_size>1;}
forward_range( I first, I last, size_t size ) : my_begin(first), my_end(last), my_size(size) {
assert( size==size_t(distance( first,last )));
}
forward_range( forward_range& r, tbb::split ) {
size_t h = r.my_size/2;
my_end = r.my_end;
my_begin = r.my_begin;
advance( my_begin, h ); // Might be scaling issue
my_size = r.my_size-h;
r.my_end = my_begin;
r.my_size = h;
}
};
class TBBFunctor {
public :
TBBFunctor() {}
void operator()( forward_range< std::set< int >::const_iterator > & range ) const
{
for( std::set< int >::const_iterator i=range.begin(); i!=range.end(); ++i )
std::printf("%dn",*i);
}
};
int main( )
{
std::set< int > vec;
for( int i=0; i<20; ++i )
vec.insert(i);
tbb::parallel_for( forward_range<:SET>< int >::const_iterator >( vec.begin(), vec.end(), vec.size() ), TBBFunctor() );
}
[/cpp]
Notice the line "Might be scaling issue". This is another reason for blocke_range to not support forward iterators, because in general the time to traverse the range from begin to end would be inherently linear (not parallelizable) and thus presents a scalability issue.
I suggest trying parallel_do first, since it is specifically designed for forward_iterators.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thansk a lot Arch,
I think I'll follow your first suggestion using "parallel_do".
I think I'll follow your first suggestion using "parallel_do".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Notice the line "Might be scaling issue". This is another reason for blocke_range to not support forward iterators, because in general the time to traverse the range from begin to end would be inherently linear (not parallelizable) and thus presents a scalability issue."
Have you observed that forward_range has scaling problems compared to parallel_do, or can you explain it more clearly? forward_range would do some extra walking, but much of the walking is done in parallel and read-only, and then the benefits of recursive parallelism would kick in. There's no redeeming feeder for parallel_do's non-recursiveness. What did I miss?
Have you observed that forward_range has scaling problems compared to parallel_do, or can you explain it more clearly? forward_range would do some extra walking, but much of the walking is done in parallel and read-only, and then the benefits of recursive parallelism would kick in. There's no redeeming feeder for parallel_do's non-recursiveness. What did I miss?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem with divide and conquer on forward-iterated ranges is that it involves quadratic number of iterations with respect to the number of splits. Thus the amount of processing work per each chunk necessary to offset the cost of splitting procedure grows quadratically too. Taking into account that the cost of memory access is pretty high (even 2nd level cache hit cost around 8 clock ticks), not every algorithm can provide such amount of computational activity.
E.g. if we have a linked data structure of 10000 elements, and it is recursively partitioned into 1000 chunks (10 items per chunk), the cost of partitioning will be 1000 * 1000 / 2 operations multiplied by effective memory access latency (1 to 3 clock ticks in case of L1 cache hit). This gives us approx 1 000 000 clock ticks per 10 items, or 100 000 clock ticks per item to make dual core performance equal to serial version (any other bookkeeping cost aside). And the scalability will be limited by the factor 2 (because of the Amdahl law).
Another drawback of linked data structures is poor hardware prefetching performance, and caches pollution by bookkeeping data of the data structure nodes.
E.g. if we have a linked data structure of 10000 elements, and it is recursively partitioned into 1000 chunks (10 items per chunk), the cost of partitioning will be 1000 * 1000 / 2 operations multiplied by effective memory access latency (1 to 3 clock ticks in case of L1 cache hit). This gives us approx 1 000 000 clock ticks per 10 items, or 100 000 clock ticks per item to make dual core performance equal to serial version (any other bookkeeping cost aside). And the scalability will be limited by the factor 2 (because of the Amdahl law).
Another drawback of linked data structures is poor hardware prefetching performance, and caches pollution by bookkeeping data of the data structure nodes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How do you get O(N^3) iterations for O(N^2) work per chunk with O(N) chunks? I get only O(log N) work per chunk, which seems reasonable. Then, there's no need to allocate all tasks in one thread (unless I'm mistaken about parallel_do) and then remotely deallocate them, instead of doing those things mostly locally with parallel_for. I do get the argument about prefetching performance, but it gets a bit blurry with the cache pollution. Should I try again with the TV off?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, it's definitely not O(N^3) :-)
In my example calculations I used N^2/2 metric for the number of memory accesses necessary to iterate through the linked data structure. But I indeed seem to have got it wrong, as it should be (N logN) / 2 only. I've also missed the fact that I had to multiply the number of iterations in my example by the number of nodes in the chunck (as forward iterator does not allow you to jump from chunk to chunk).
So let's consider two variants of the same example: 10000 nodes with chunk size 1 and 10. The cost of splitting (only node accesses are taken into account) will be 10000 * log 10000 * 1 / 2 66 500 in the first case, and 1000 * log 1000 * 10 / 2 50 000
Factoring in other operations involved and taking into account cost of memory access we'll still get the splitting overhead well above 100 000 clockticks. It's not as bad as the figure in my yesterday's post :-), but still cannot be easily discounted.
When I mentioned cache pollution I meant that in case of an array, the cache would contain only data being processed. While in case of a linked data structure the cache will hold members of the data structure nodes (ranging from one or two pointers for linked lists to half dozen and more for various trees). When working with let's say doubles, this dead weight may consume significant part of cache capacity.
In my example calculations I used N^2/2 metric for the number of memory accesses necessary to iterate through the linked data structure. But I indeed seem to have got it wrong, as it should be (N logN) / 2 only. I've also missed the fact that I had to multiply the number of iterations in my example by the number of nodes in the chunck (as forward iterator does not allow you to jump from chunk to chunk).
So let's consider two variants of the same example: 10000 nodes with chunk size 1 and 10. The cost of splitting (only node accesses are taken into account) will be 10000 * log 10000 * 1 / 2 66 500 in the first case, and 1000 * log 1000 * 10 / 2 50 000
Factoring in other operations involved and taking into account cost of memory access we'll still get the splitting overhead well above 100 000 clockticks. It's not as bad as the figure in my yesterday's post :-), but still cannot be easily discounted.
When I mentioned cache pollution I meant that in case of an array, the cache would contain only data being processed. While in case of a linked data structure the cache will hold members of the data structure nodes (ranging from one or two pointers for linked lists to half dozen and more for various trees). When working with let's say doubles, this dead weight may consume significant part of cache capacity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"So let's consider two variants of the same example: 10000 nodes with chunk size 1 and 10. The cost of splitting (only node accesses are taken into account) will be 10000 * log 10000 * 1 / 2 66 500 in the first case, and 1000 * log 1000 * 10 / 2 50 000"
6.65 or 5 accesses per node doesn't look prohibitive?
If this is used at the highest level, I might consider doling out a few chunks right at the start to counter start-up latency, but otherwise I see no reason to discount divide and conquer without actually trying.
6.65 or 5 accesses per node doesn't look prohibitive?
If this is used at the highest level, I might consider doling out a few chunks right at the start to counter start-up latency, but otherwise I see no reason to discount divide and conquer without actually trying.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree, set aside generally lesser cache efficiency, overhead does not look too bad, so divide and conquer may turn out quite scalable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To clarify a technical point: divide-and-conquer on forward-iterated ranges takes only O(N lg N) total work, not quadratic work. The trick is to walk the ranged once to get its length, and then remember the length throughout the rest of the splitting process.
Scalability will nonetheless be poor because even with an infinite number processors, the critical path is O(N) and as Andrey points out, cache behavior will be horrible. In fact it is probably faster to walk the list once to store the elements into an array, and then work on the array.
Scalability will nonetheless be poor because even with an infinite number processors, the critical path is O(N) and as Andrey points out, cache behavior will be horrible. In fact it is probably faster to walk the list once to store the elements into an array, and then work on the array.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"The trick is to walk the ranged once to get its length, and then remember the length throughout the rest of the splitting process."
In this case the length was a constructor parameter, of course.
"In fact it is probably faster to walk the list once to store the elements into an array, and then work on the array."
parallel_for() with an array would obviously be significantly faster than with forward_range, but I'm not yet convinced that it's an unavoidable intermediate representation beyond an array sized a small multiple of the number of available workers, especially at the outermost level where the cache might be overwhelmed (what would be the serious technical word for that again?), or perhaps recursively.
Anyhow, it's just an interesting thought experiment until somebody (else) thinks it would be even more fun to actually try this. :-) And it shows how recursively divisible ranges, even if the division is only approximate and therefore not entirely equivalent to random access, would be a welcome addition to STL's current repertoire of iterators!
In this case the length was a constructor parameter, of course.
"In fact it is probably faster to walk the list once to store the elements into an array, and then work on the array."
parallel_for() with an array would obviously be significantly faster than with forward_range, but I'm not yet convinced that it's an unavoidable intermediate representation beyond an array sized a small multiple of the number of available workers, especially at the outermost level where the cache might be overwhelmed (what would be the serious technical word for that again?), or perhaps recursively.
Anyhow, it's just an interesting thought experiment until somebody (else) thinks it would be even more fun to actually try this. :-) And it shows how recursively divisible ranges, even if the division is only approximate and therefore not entirely equivalent to random access, would be a welcome addition to STL's current repertoire of iterators!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page