parallel_for_each very slow?

j_morrow · ‎07-16-2012

Hi,

comparing std::for_each vs. tbb::parallel_for_each on a std::vector containing 100.000.000 int's i got some strange results. I'm adding the constant 5 to each element in the vector using this code:

[bash]#include #include #include #include #include "tbb.h" int rnd () { return (rand()%100); } class Adder { public: Adder(int v) : value(v) {} void operator () (int &v) const { v += value; } int value; }; int main() { std::vector numbers (100000000); std::generate(numbers.begin(), numbers.end(), rnd); DWORD time = timeGetTime(); std::for_each(numbers.begin(), numbers.end(), Adder(5)); std::cerr << "time spent (std::for_each): " << timeGetTime()-time << std::endl; time = timeGetTime(); tbb::parallel_for_each(numbers.begin(), numbers.end(), Adder(5)); std::cerr << "time spent (tbb::parallel_for_each): " << timeGetTime()-time << std::endl; return 0; }[/bash]Measuring the time for each version i got the following result:

time spent (std::for_each): 121
time spent (tbb::parallel_for_each): 6722

Am I expecting something wrong or shouldn't the parallel_for_each-version be faster?

Regards,
Jerome

System: Core i3, Windows SDK 7.0, tbb40_20120613oss

danastas · ‎07-16-2012

No, the behaviour seems to be right. In practice, parallel_for_each creates a job for each single element of your vector. However, the number of operation performed on a single block is way to low to justify the cost of the splitting, of building the job and running, hence the big execution time.
If you use tbb::parallel_for( ... ), with a reasonably big grainsize (100000?), you should be able to see a significant speed-up compared to the STL algorithm.

Hope it helps

j_morrow · ‎07-16-2012

Thanks for the fast answer. I tried tbb::parallel_for and got a significiant speed-up. But still the STL algorithm beats the TBB version. I will take a closer look at this problem later this night.

But perhaps you can also help with another question. Measuring tbb::parallel_sort on the same vector i got two really different times.

[bash]#include #include #include #include #include "tbb.h" using namespace std; int RandomNumber () { return (rand()%100); } int main() { std::vector numbers (100000000), numbers2, numbers3; std::generate(numbers.begin(), numbers.end(), RandomNumber); numbers2.insert(numbers2.end(), numbers.begin(), numbers.end()); numbers3.insert(numbers3.end(), numbers.begin(), numbers.end()); // Version 1 STL (std::sort) DWORD time = timeGetTime(); std::sort(numbers.begin(), numbers.end()); std::cerr << "time spent (std::sort): " << timeGetTime()-time << std::endl; // Version 2 TBB (tbb:parallel_sort) using iterators time = timeGetTime(); tbb::parallel_sort(numbers2.begin(), numbers2.end()); std::cerr << "time spent (tbb::parallel_sort using iterators): " << timeGetTime()-time << std::endl; // Version 3 TBB (tbb:parallel_sort) using adress of the internal array time = timeGetTime(); tbb::parallel_sort(&numbers3[0], &numbers3[0]+numbers3.size()); std::cerr << "time spent (tbb::parallel_sort): " << timeGetTime()-time << std::endl; return 0; } [/bash]
The first version (pure stl::sort) took 4944 ms. The second version (tbb::parallel_sort using iterators) took 10019 ms. And finally the third version (tbb::parallel_sort using adress of the internal array) took 2452 ms. That's what I expected. But what is the difference between version 2 and 3? Finally both use (parallel_sort.h):
[bash]void parallel_sort( RandomAccessIterator begin, RandomAccessIterator end, const Compare& comp) { const int min_parallel_size = 500; if( end > begin ) { if (end - begin < min_parallel_size) { std::sort(begin, end, comp); } else { internal::parallel_quick_sort(begin, end, comp); } } }[/bash]
Thanks in advance

Vladimir_P_1234567890 · ‎07-16-2012

Hello,

std::vector is serial, you need to use tbb's concurrent_vector to get any speed-up.

this should help:)

--Vladimir

danastas · ‎07-17-2012

Without knowing what your internal::parallel_quick_sort does, it's hard to say what's going wrong. However, from my personal experience, I've realized that using pointer arithmetic leads to better results than the subscript operator of the class std::vector (it's not hard to image why).

I do use std::vector a lot (everybody should!) and I usually pass the underlying vector using std::vector::data() member to my functions. A better solution could be to use iterators, instead than the subscription operator.

Have a look at this example that I've coded a couple of days ago [1], it does something really similar to your first example.

[1] https://github.com/davideanastasia/intel-tbb-experiments/tree/master/parallel_pow

j_morrow · ‎07-17-2012

Hi,

using tbb's concurrent_vector is obviously the correct way. My fault. But I still have the same problem regarding tbb::parallel_sort. Using iterators is slower than using a pointer to the array as input to parallel_sort (6427ms vs. 2215ms)

[bash]int RandomNumber () { return (rand()%100); } int main() { tbb::concurrent_vector numbers2 (100000000), numbers3; std::generate(numbers2.begin(), numbers2.end(), RandomNumber); numbers3 = numbers2; // Version 2 TBB (tbb:parallel_sort) using iterators DWORD time = timeGetTime(); tbb::parallel_sort(numbers2.begin(), numbers2.end()); std::cerr << "time spent (tbb::parallel_sort using iterators): " << timeGetTime()-time << std::endl; // Version 3 TBB (tbb:parallel_sort) using address of internal array time = timeGetTime(); tbb::parallel_sort(&numbers3[0], &numbers3[0]+numbers3.size()); std::cerr << "time spent (tbb::parallel_sort using iterators): " << timeGetTime()-time << std::endl; return 0; } [/bash]

Any hints?

Jerome

RafSchietekat · ‎07-17-2012

#1 "If you use tbb::parallel_for( ... ), with a reasonably big grainsize (100000?), you should be able to see a significant speed-up compared to the STL algorithm."
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.

#2 "The first version (pure stl::sort) took 4944 ms. The second version (tbb::parallel_sort using iterators) took 10019 ms. And finally the third version (tbb::parallel_sort using adress of the internal array) took 2452 ms. That's what I expected. But what is the difference between version 2 and 3?"
Perhaps it's an implementation issue (have a look at the code if you can), like a sanity check on the iterator (just speculating, as I only really know about a range check on operator[] in another implementation)? You would then probably have the same issue with parallel_for. Note that you may get another speedup with parallel_for() from hoisting Range::end() out of the loop in the Body (make a copy once and compare with that), to make it easier or at all possible for the compiler to apply some very relevant inner-loop optimisations. But you didn't specify whether these timings were without or with optimisation?

#3 "std::vector is serial, you need to use tbb's concurrent_vector to get any speed-up."
I recently got into trouble myself for answering too hastily...

j_morrow · ‎07-17-2012

Thanks for your answer. Using the std::vector::data() is better than the subscription operator.
But for now I'm really confused. Reading the examples and references of TBB and your answers i have three options for using a vector:

1. std::vector
2. std::vector >
3. tbb::concurrent_vector

So, which one is the correct option.

Jerome

RafSchietekat · ‎07-17-2012

#4 "A better solution could be to use iterators, instead than the subscription operator."
I was assuming that iterators were already being used, as in the sort example provided?

#5 "using tbb's concurrent_vector is obviously the correct way."
Not when competing with an efficient std::vector, but perhaps compared to the one you have it may still turn out to be a winner. Note that your current overhead may depend on whether you have a debug or release build, with some assertions perhaps disabled in the release build.

RafSchietekat · ‎07-17-2012

#7 "So, which one is the correct option."
If you need to grow a vector concurrently, use concurrent_vector, otherwise an ordinary std::vector should be enough. You might use the scalable allocator on specific objects like std::vector, although that wouldn't matter much if you used reserve() properly, and you'll likely have a more important gain by replacing new/delete altogether.

j_morrow · ‎07-17-2012

#8 I'm using the Windows SDK 7.0 (VC 9) and these compiler options
"-nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189 -DUNICODE -DWIN32" for a release build

RafSchietekat · ‎07-17-2012

Strange that these results are for a release build, as that seems to rule out the simplest explanation. It would still be instructive to compare with another platform... and by doing that myself on the example in #5 (using tbb::tick_count) on Intel Core Duo/Ubuntu/g++ with a recent TBB I got the following interesting and worrying outcome: 24.3404 vs. 6.83917.

(Added) Aha, with an ordinary std::vector it's better: 6.41283 vs. 6.55922. I hadn't noticed the concurrent_vector at first, so this would be evidence that a std::vector is indeed more efficient... and you have to be very careful to know whether the concurrent_vector's allocation is contiguous at any point in time!

j_morrow · ‎07-17-2012

@Raf Schietekat

Thanks for your efforts! Finally I found the problem. Trying a different compiler (Windows SDK 7.1 (VS 2010)) everything was alright. Except the difference between tbb::concurrent_vector and std::vector you also mentioned.

So I think using VS 2008 and iterators of stl-conatiners is a huge problem.

Jerome

danastas · ‎07-17-2012

#1 "If you use tbb::parallel_for( ... ), with a reasonably big grainsize (100000?), you should be able to see a significant speed-up compared to the STL algorithm."
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.

The manual says that best performance are achieved when a grainsize between 10k and 100k is used. From my personal experience, whether the best number is closer to one end or another, depends on how complex the "kernel" is (the computation per sample). In this case, a simple addition is an easy job for any modern processor, so I will definitely opt for the bigger end of this interval. However, I usually use the auto_partitioner anyway.

danastas · ‎07-17-2012

Based on [1,2], it seems that VS9 STL implementation is not the best performing around.

[1] http://askldjd.wordpress.com/2010/07/24/stl-performance-comparison-round-2-vc9-vc10-stlport/
[2] http://askldjd.wordpress.com/2009/09/13/stl-performance-comparison-vc71-vc90-and-stlport/