- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
comparing std::for_each vs. tbb::parallel_for_each on a std::vector containing 100.000.000 int's i got some strange results. I'm adding the constant 5 to each element in the vector using this code:
[bash]#include
#include
#include
#include
#include "tbb.h"
int rnd () { return (rand()%100); }
class Adder
{
public:
Adder(int v) : value(v) {}
void operator () (int &v) const { v += value; }
int value;
};
int main()
{
std::vector numbers (100000000);
std::generate(numbers.begin(), numbers.end(), rnd);
DWORD time = timeGetTime();
std::for_each(numbers.begin(), numbers.end(), Adder(5));
std::cerr << "time spent (std::for_each): " << timeGetTime()-time << std::endl;
time = timeGetTime();
tbb::parallel_for_each(numbers.begin(), numbers.end(), Adder(5));
std::cerr << "time spent (tbb::parallel_for_each): " << timeGetTime()-time << std::endl;
return 0;
}[/bash]Measuring the time for each version i got the following result:
time spent (std::for_each): 121
time spent (tbb::parallel_for_each): 6722
Am I expecting something wrong or shouldn't the parallel_for_each-version be faster?
Regards,
Jerome
System: Core i3, Windows SDK 7.0, tbb40_20120613oss
comparing std::for_each vs. tbb::parallel_for_each on a std::vector containing 100.000.000 int's i got some strange results. I'm adding the constant 5 to each element in the vector using this code:
[bash]#include
time spent (std::for_each): 121
time spent (tbb::parallel_for_each): 6722
Am I expecting something wrong or shouldn't the parallel_for_each-version be faster?
Regards,
Jerome
System: Core i3, Windows SDK 7.0, tbb40_20120613oss
Link Copied
14 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, the behaviour seems to be right. In practice, parallel_for_each creates a job for each single element of your vector. However, the number of operation performed on a single block is way to low to justify the cost of the splitting, of building the job and running, hence the big execution time.
If you use tbb::parallel_for( ... ), with a reasonably big grainsize (100000?), you should be able to see a significant speed-up compared to the STL algorithm.
Hope it helps
If you use tbb::parallel_for( ... ), with a reasonably big grainsize (100000?), you should be able to see a significant speed-up compared to the STL algorithm.
Hope it helps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the fast answer. I tried tbb::parallel_for and got a significiant speed-up. But still the STL algorithm beats the TBB version. I will take a closer look at this problem later this night.
But perhaps you can also help with another question. Measuring tbb::parallel_sort on the same vector i got two really different times.
[bash]#include
#include
#include
#include
#include "tbb.h"
using namespace std;
int RandomNumber () { return (rand()%100); }
int main()
{
std::vector numbers (100000000), numbers2, numbers3;
std::generate(numbers.begin(), numbers.end(), RandomNumber);
numbers2.insert(numbers2.end(), numbers.begin(), numbers.end());
numbers3.insert(numbers3.end(), numbers.begin(), numbers.end());
// Version 1 STL (std::sort)
DWORD time = timeGetTime();
std::sort(numbers.begin(), numbers.end());
std::cerr << "time spent (std::sort): " << timeGetTime()-time << std::endl;
// Version 2 TBB (tbb:parallel_sort) using iterators
time = timeGetTime();
tbb::parallel_sort(numbers2.begin(), numbers2.end());
std::cerr << "time spent (tbb::parallel_sort using iterators): " << timeGetTime()-time << std::endl;
// Version 3 TBB (tbb:parallel_sort) using adress of the internal array
time = timeGetTime();
tbb::parallel_sort(&numbers3[0], &numbers3[0]+numbers3.size());
std::cerr << "time spent (tbb::parallel_sort): " << timeGetTime()-time << std::endl;
return 0;
}
[/bash]
The first version (pure stl::sort) took 4944 ms. The second version (tbb::parallel_sort using iterators) took 10019 ms. And finally the third version (tbb::parallel_sort using adress of the internal array) took 2452 ms. That's what I expected. But what is the difference between version 2 and 3? Finally both use (parallel_sort.h):
[bash]void parallel_sort( RandomAccessIterator begin, RandomAccessIterator end, const Compare& comp) { const int min_parallel_size = 500; if( end > begin ) { if (end - begin < min_parallel_size) { std::sort(begin, end, comp); } else { internal::parallel_quick_sort(begin, end, comp); } } }[/bash]
Thanks in advance
But perhaps you can also help with another question. Measuring tbb::parallel_sort on the same vector i got two really different times.
[bash]#include
The first version (pure stl::sort) took 4944 ms. The second version (tbb::parallel_sort using iterators) took 10019 ms. And finally the third version (tbb::parallel_sort using adress of the internal array) took 2452 ms. That's what I expected. But what is the difference between version 2 and 3? Finally both use (parallel_sort.h):
[bash]void parallel_sort( RandomAccessIterator begin, RandomAccessIterator end, const Compare& comp) { const int min_parallel_size = 500; if( end > begin ) { if (end - begin < min_parallel_size) { std::sort(begin, end, comp); } else { internal::parallel_quick_sort(begin, end, comp); } } }[/bash]
Thanks in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
std::vector is serial, you need to use tbb's concurrent_vector to get any speed-up.
this should help:)
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Without knowing what your internal::parallel_quick_sort does, it's hard to say what's going wrong. However, from my personal experience, I've realized that using pointer arithmetic leads to better results than the subscript operator of the class std::vector (it's not hard to image why).
I do use std::vector a lot (everybody should!) and I usually pass the underlying vector using std::vector::data() member to my functions. A better solution could be to use iterators, instead than the subscription operator.
Have a look at this example that I've coded a couple of days ago [1], it does something really similar to your first example.
[1] https://github.com/davideanastasia/intel-tbb-experiments/tree/master/parallel_pow
I do use std::vector a lot (everybody should!) and I usually pass the underlying vector using std::vector
Have a look at this example that I've coded a couple of days ago [1], it does something really similar to your first example.
[1] https://github.com/davideanastasia/intel-tbb-experiments/tree/master/parallel_pow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
using tbb's concurrent_vector is obviously the correct way. My fault. But I still have the same problem regarding tbb::parallel_sort. Using iterators is slower than using a pointer to the array as input to parallel_sort (6427ms vs. 2215ms)
[bash]int RandomNumber () { return (rand()%100); } int main() { tbb::concurrent_vector numbers2 (100000000), numbers3;
std::generate(numbers2.begin(), numbers2.end(), RandomNumber);
numbers3 = numbers2;
// Version 2 TBB (tbb:parallel_sort) using iterators
DWORD time = timeGetTime();
tbb::parallel_sort(numbers2.begin(), numbers2.end());
std::cerr << "time spent (tbb::parallel_sort using iterators): " << timeGetTime()-time << std::endl;
// Version 3 TBB (tbb:parallel_sort) using address of internal array
time = timeGetTime();
tbb::parallel_sort(&numbers3[0], &numbers3[0]+numbers3.size());
std::cerr << "time spent (tbb::parallel_sort using iterators): " << timeGetTime()-time << std::endl;
return 0;
}
[/bash]
Any hints?
Jerome
using tbb's concurrent_vector is obviously the correct way. My fault. But I still have the same problem regarding tbb::parallel_sort. Using iterators is slower than using a pointer to the array as input to parallel_sort (6427ms vs. 2215ms)
[bash]int RandomNumber () { return (rand()%100); } int main() { tbb::concurrent_vector
Any hints?
Jerome
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#1 "If you use tbb::parallel_for( ... ), with a reasonably big grainsize (100000?), you should be able to see a significant speed-up compared to the STL algorithm."
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.
#2 "The first version (pure stl::sort) took 4944 ms. The second version (tbb::parallel_sort using iterators) took 10019 ms. And finally the third version (tbb::parallel_sort using adress of the internal array) took 2452 ms. That's what I expected. But what is the difference between version 2 and 3?"
Perhaps it's an implementation issue (have a look at the code if you can), like a sanity check on the iterator (just speculating, as I only really know about a range check on operator[] in another implementation)? You would then probably have the same issue with parallel_for. Note that you may get another speedup with parallel_for() from hoisting Range::end() out of the loop in the Body (make a copy once and compare with that), to make it easier or at all possible for the compiler to apply some very relevant inner-loop optimisations. But you didn't specify whether these timings were without or with optimisation?
#3 "std::vector is serial, you need to use tbb's concurrent_vector to get any speed-up."
I recently got into trouble myself for answering too hastily...
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.
#2 "The first version (pure stl::sort) took 4944 ms. The second version (tbb::parallel_sort using iterators) took 10019 ms. And finally the third version (tbb::parallel_sort using adress of the internal array) took 2452 ms. That's what I expected. But what is the difference between version 2 and 3?"
Perhaps it's an implementation issue (have a look at the code if you can), like a sanity check on the iterator (just speculating, as I only really know about a range check on operator[] in another implementation)? You would then probably have the same issue with parallel_for. Note that you may get another speedup with parallel_for() from hoisting Range::end() out of the loop in the Body (make a copy once and compare with that), to make it easier or at all possible for the compiler to apply some very relevant inner-loop optimisations. But you didn't specify whether these timings were without or with optimisation?
#3 "std::vector is serial, you need to use tbb's concurrent_vector to get any speed-up."
I recently got into trouble myself for answering too hastily...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answer. Using the std::vector::data() is better than the subscription operator.
But for now I'm really confused. Reading the examples and references of TBB and your answers i have three options for using a vector:
1. std::vector
2. std::vector >
3. tbb::concurrent_vector
So, which one is the correct option.
Jerome
But for now I'm really confused. Reading the examples and references of TBB and your answers i have three options for using a vector:
1. std::vector
2. std::vector
3. tbb::concurrent_vector
So, which one is the correct option.
Jerome
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#4 "A better solution could be to use iterators, instead than the subscription operator."
I was assuming that iterators were already being used, as in the sort example provided?
#5 "using tbb's concurrent_vector is obviously the correct way."
Not when competing with an efficient std::vector, but perhaps compared to the one you have it may still turn out to be a winner. Note that your current overhead may depend on whether you have a debug or release build, with some assertions perhaps disabled in the release build.
I was assuming that iterators were already being used, as in the sort example provided?
#5 "using tbb's concurrent_vector is obviously the correct way."
Not when competing with an efficient std::vector, but perhaps compared to the one you have it may still turn out to be a winner. Note that your current overhead may depend on whether you have a debug or release build, with some assertions perhaps disabled in the release build.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#7 "So, which one is the correct option."
If you need to grow a vector concurrently, use concurrent_vector, otherwise an ordinary std::vector should be enough. You might use the scalable allocator on specific objects like std::vector, although that wouldn't matter much if you used reserve() properly, and you'll likely have a more important gain by replacing new/delete altogether.
If you need to grow a vector concurrently, use concurrent_vector, otherwise an ordinary std::vector should be enough. You might use the scalable allocator on specific objects like std::vector, although that wouldn't matter much if you used reserve() properly, and you'll likely have a more important gain by replacing new/delete altogether.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#8 I'm using the Windows SDK 7.0 (VC 9) and these compiler options
"-nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189 -DUNICODE -DWIN32" for a release build
"-nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189 -DUNICODE -DWIN32" for a release build
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Strange that these results are for a release build, as that seems to rule out the simplest explanation. It would still be instructive to compare with another platform... and by doing that myself on the example in #5 (using tbb::tick_count) on Intel Core Duo/Ubuntu/g++ with a recent TBB I got the following interesting and worrying outcome: 24.3404 vs. 6.83917.
(Added) Aha, with an ordinary std::vector it's better: 6.41283 vs. 6.55922. I hadn't noticed the concurrent_vector at first, so this would be evidence that a std::vector is indeed more efficient... and you have to be very careful to know whether the concurrent_vector's allocation is contiguous at any point in time!
(Added) Aha, with an ordinary std::vector it's better: 6.41283 vs. 6.55922. I hadn't noticed the concurrent_vector at first, so this would be evidence that a std::vector is indeed more efficient... and you have to be very careful to know whether the concurrent_vector's allocation is contiguous at any point in time!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Raf Schietekat
Thanks for your efforts! Finally I found the problem. Trying a different compiler (Windows SDK 7.1 (VS 2010)) everything was alright. Except the difference between tbb::concurrent_vector and std::vector you also mentioned.
So I think using VS 2008 and iterators of stl-conatiners is a huge problem.
Jerome
Thanks for your efforts! Finally I found the problem. Trying a different compiler (Windows SDK 7.1 (VS 2010)) everything was alright. Except the difference between tbb::concurrent_vector and std::vector you also mentioned.
So I think using VS 2008 and iterators of stl-conatiners is a huge problem.
Jerome
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#1 "If you use tbb::parallel_for( ... ), with a reasonably big
grainsize (100000?), you should be able to see a significant speed-up
compared to the STL algorithm."
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.
The manual says that best performance are achieved when a grainsize between 10k and 100k is used. From my personal experience, whether the best number is closer to one end or another, depends on how complex the "kernel" is (the computation per sample). In this case, a simple addition is an easy job for any modern processor, so I will definitely opt for the bigger end of this interval. However, I usually use the auto_partitioner anyway.
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.
The manual says that best performance are achieved when a grainsize between 10k and 100k is used. From my personal experience, whether the best number is closer to one end or another, depends on how complex the "kernel" is (the computation per sample). In this case, a simple addition is an easy job for any modern processor, so I will definitely opt for the bigger end of this interval. However, I usually use the auto_partitioner anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Based on [1,2], it seems that VS9 STL implementation is not the best performing around.
[1] http://askldjd.wordpress.com/2010/07/24/stl-performance-comparison-round-2-vc9-vc10-stlport/
[2] http://askldjd.wordpress.com/2009/09/13/stl-performance-comparison-vc71-vc90-and-stlport/
[1] http://askldjd.wordpress.com/2010/07/24/stl-performance-comparison-round-2-vc9-vc10-stlport/
[2] http://askldjd.wordpress.com/2009/09/13/stl-performance-comparison-vc71-vc90-and-stlport/

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page