- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am playing around with TBB's parallel_for and parallel_for_each in combination with an STL vector. I used the following test code (also in the attachement) with the corresponding questions below:
[cpp]
#include <iostream>
#include <algorithm>
#include <vector>
#include "tbb/parallel_for_each.h"
#include "tbb/tick_count.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
int rnd() {
return (rand() % 100);
}
int main()
{
int addition = 5;
tbb::tick_count before = tbb::tick_count::now();
std::vector<int> numbers(100000000);
std::generate(numbers.begin(), numbers.end(), rnd);
tbb::tick_count after = tbb::tick_count::now();
std::printf("time spent (generate random): \t%g seconds\n", (after - before).seconds());
//std::for_each version
before = tbb::tick_count::now();
std::for_each(numbers.begin(), numbers.end(), [&] (int &v) { v += addition;});
after = tbb::tick_count::now();
std::printf("time spent (std::for_each): \t%g seconds\n", (after - before).seconds());
//tbb::parallel_for_each version
before = tbb::tick_count::now();
tbb::parallel_for_each(numbers.begin(), numbers.end(), [&] (int &v) { v += addition;});
after = tbb::tick_count::now();
std::printf("time spent (tbb::parallel_for_each): \t%g seconds\n", (after - before).seconds());
//tbb::parallel_for iterator version
before = tbb::tick_count::now();
tbb::parallel_for(
tbb::blocked_range<std::vector<int>::iterator>(numbers.begin(),
numbers.end()),
[&] (tbb::blocked_range<std::vector<int>::iterator> number) {
for (std::vector<int>::iterator it = number.begin(); it != number.end(); it++) {
(*it) += addition;
}
});
after = tbb::tick_count::now();
std::printf("time spent (tbb::parallel_for iterator): \t%g seconds\n", (after - before).seconds());
//tbb::parallel_for index version
before = tbb::tick_count::now();
tbb::parallel_for(size_t(0), size_t(numbers.size()),
[&] (size_t index) { numbers[index] += addition; });
after = tbb::tick_count::now();
std::printf("time spent (tbb::parallel_for index): \t%g seconds\n",(after - before).seconds());
return 0;
}
[/cpp]
std::for_each version:
The execution time of this version is 0.488757 seconds.
tbb::parallel_for_each version:
The measurement of adding 5 to every element using the parallel_for_each is by far the slowest with 8.59165 seconds. Does parallel_for_each execute each iteration in its own thread producing the overhead to slow down this version? Does this mean, parallel_for_each should only be used if each iteration needs a large number of CPU cycles?
tbb::parallel_for iterator version:
I am not sure whether this implementation using iterators is the simplest one. Is there a shorter / simpler implementation of using parallel_for with iterator of an STL vector? The execution time for this version is: 1.12382 seconds. This is slower than using std::for_each and gets me to the question, under which condition parallel_for using iterator is a practical solution?
parallel_for index version:
The execution time for this version is 0.452654 seconds which tops the serial std::for_each version and is what I expected. Is this the proposed way to use parallel_for with an STL vector?
Further, my assumption is that using an std::vector in this test code is safe, also the parallel writes at different indexes, as the elements are independent and the vector is not expected to grow. Is this assumption correct?
Thanks for feedback,
martin
The measurements are done on Win7 using Microsoft Visual C++ Compiler Version 16.00.30319.01, TBB 4.1, on Intel i7-2620M CPU @ 2.70 GHz
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#0 "tbb::parallel_for iterator version: I am not sure whether this implementation using iterators is the simplest one. Is there a shorter / simpler implementation of using parallel_for with iterator of an STL vector? The execution time for this version is: 1.12382 seconds. This is slower than using std::for_each and gets me to the question, under which condition parallel_for using iterator is a practical solution?" You should probably "hoist" the evaluation of end() out of the loop, by defining "it_end = number.end()" next to "it" (unfortunately you cannot make it_end const without taking that variable out of the loop header), which may allow considerable optimisation from the compiler. Is the documentation not clear about that (I didn't check that myself at this time)? You can probably also pass the blocked_range as a const reference, but I'm not sure what the effect will be. Please report your findings.
#0 "parallel_for index version: The execution time for this version is 0.452654 seconds which tops the serial std::for_each version and is what I expected. Is this the proposed way to use parallel_for with an STL vector?" That seems abysmally bad, unless you're running this on a single thread? You should get very good scalability for this simple kernel.
#0 "Further, my assumption is that using an std::vector in this test code is safe, also the parallel writes at different indexes, as the elements are independent and the vector is not expected to grow. Is this assumption correct?" I think the C++ specification is deficient by neglecting to explicitly provide this guarantee (unless I overlooked it), but normally it's a safe assumption.
#1 "*** do not use rand inside test load code *** Functions such as this have a critical section. Meaning your parallel test runs a serial section (plus overhead to serialize the critical section). Consider changing your test load function to something that does not use a critical section." This is probably harmless in the test setup.
#1 "Generally speaking: Use for( or for_each when the amount of computation is small (not worth the overhead to parallelize). Use parallel_for when you have a large number of objects and each object has equal work of small work. Use parallel_for with appropriate grain size when you have a large number of objects and each object has un-equal work of small work. Use parallel_for_each when each objects work is relatively large and number of objects is few and each object has un-equal work. When number of objects is very small and known, consider using switch(nObjects) and cases using parallel_invoke in each case that you wish to impliment and parallel_for_each for the default case." Hmm, using a parallel construct instead of a serial construct allows cancellation, use grainsize is only indirectly related to equal/unequal work (extremely unequal work precludes auto_partitioner and simple_partitioner works best with a properly configured grainsize), parallel_for_each is never quicker than parallel_for (assuming random iterators) but may be more convenient than parallel_for with explicit simple_partitioner (assuming stated conditions), and I'm curious what speedup you could possibly get from switch and parallel_invoke().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps you could make a comparison with parallel_for() explicitly configured with simple_partitioner and grainsize 1 to see whether this fully explains the differenceI did this test as proposed and I get an execution time of 10.2749 seconds which is approximately 2 seconds slower than using parallel_for_each. So having grainsize 1 and using the simple_partitioner disables the automatic chunking and spawns a thread for every iteration. This is slower than parallel_for_each which let me assume that parallel_for_each works a little differently as you explained in the answer.
You should probably "hoist" the evaluation of end() out of the loop, by defining "it_end = number.end()" next to "it" (unfortunately you cannot make it_end const without taking that variable out of the loop header), which may allow considerable optimisation from the compilerI did the proposed optimizations: [cpp] tbb::parallel_for( tbb::blocked_range<:VECTOR>
Is the documentation not clear about that (I didn't check that myself at this time)?I couldn't find anything in the documentation that explains the usage of iterators with TBB. A similar question I found here: http://software.intel.com/en-us/forums/topic/304696
The execution time for this version is 0.452654 seconds which tops the serial std::for_each version and is what I expected. Is this the proposed way to use parallel_for with an STL vector?" That seems abysmally bad, unless you're running this on a single thread? You should get very good scalability for this simple kernelAs far as I know, the default number of thread is equals to the number of cores. Which are 4 in my case. But I figured out, that the version written with the range_concept seems to be faster. Execution time: 0.317561 seconds [cpp] tbb::parallel_for( tbb::blocked_range
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So having grainsize 1 and using the simple_partitioner disables the automatic chunking and spawns a thread for every iteration.A "task", not a "thread" (essential difference...).
I did the proposed optimizations:Oh yes, there's another level in the iterator. If you can get rid of that by iterating over pointers (vector now requires consecutive layout of elements), with boundary values evaluated at the start of the loop of course, perhaps that's what the compiler needs to work its magic.
I couldn't find anything in the documentation that explains the usage of iterators with TBB. A similar question I found here: http://software.intel.com/en-us/forums/topic/304696I specifically meant the "hoisting" of end() from the serial loop. TBB team?
But I figured out, that the version written with the range_concept seems to be faster.You should get near-perfect scalability with end() out of the loop and using pointers. I wouldn't be satisfied with anything noticeably slower than about 0.13 seconds (for_each timing divided by 4).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:Well, I really meant more specifically that, in TBB, and in situations with random iterators, based on reasoning also about their implementations, you can always expect to at least match parallel_for_each() performance with a properly configured parallel_for(). But I concede that this may be the wrong expectation where those assumptions break down. Overloading from external causes is an unsolved problem. Still, one should probably always bundle work into units (tasks) that are significantly larger than their scheduling overhead, so I'm not sure that this is a useful exploration of something not easily solved otherwise (by said bundling). As for atomic increments, for these to be faster than the specialised implementation with recursive parallelism for random iterators, doesn't the situation first have to degrade close to the point of only having one thread, or is my assumption of central sharing (or anything else) invalid?
Therefor, when work is relatively large as compared to task management overhead, then unpredictible thread scheduling is often best served using a non-deterministice task scheduling method such as parallel_for_each.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually this difference is based on the implementation of the vector, I think.Probably.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know this is an old thread but.....
time spent (generate random): 4.61568 seconds
time spent (std::for_each): 0.671217 seconds
time spent (tbb::parallel_for_each): 715.845 seconds
time spent (tbb::parallel_for iterator): 203.174 seconds
time spent (tbb::parallel_for index): 0.234936 seconds
Intel Xeon e5620 2.40ghz, 16gb Ram
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
parallel_for_each() supports input iterators or higher, and is implemented on top of parallel_do(), but has not been specialised for random-access iterators, with a more efficient implementation on top of parallel_for(). I think that the Reference Manual should at least have a warning about that, with the advice to use parallel_for where possible. A specialisation would still be beneficial for use in a template where the actual category is not known beforehand.
For parallel_for with an iterator, could you provide results for both debug and release?
(BTW, in the online Reference Manual, "Index_type last" should be "Index last", instead.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With a few modifications (tried), parallel_for_each can be as fast as parallel_for (verified), given a random-access iterator of course. Any interest?
I'm not sure why we even need two different function templates. Is it to reassure ourselves that we're getting top scalability from parallel_for? That would be a good reason, I suppose. But I don't think we need any promise that parallel_for_each will always be (much) slower...
(2015-02-07 Added) parallel_do() already differentiates its implementation based on the category of the iterator arguments.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page