- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have IPP based code which I have parallelized using OpenMP. After reading about TBB and its additional capabilities, I decided to try it replacing some OpenMP with TBB as a test. I think I have set things up correctly, and can see all 4 cores in my machine in use, but for some reason I cannot get the TBB version to run anywhere near as fast as the OpenMP version and don't understand why. My OpenMP version looks like the following:
The TBB version looks like the following:
I'm running this on a 1920x1080 image 1000 times and for some reason the TBB version is running about 4 times slower than the OpenMP version. I have tried various grain sizes and the auto_partitioner, but they all produce roughly the same results. Any ideas what I'm doing wrong?
Thanks,
Peter
I have IPP based code which I have parallelized using OpenMP. After reading about TBB and its additional capabilities, I decided to try it replacing some OpenMP with TBB as a test. I think I have set things up correctly, and can see all 4 cores in my machine in use, but for some reason I cannot get the TBB version to run anywhere near as fast as the OpenMP version and don't understand why. My OpenMP version looks like the following:
[cpp] #pragma omp parallel for private(srcPix, dstPix) for (I32 i = 0; i < srcImg.getHeight(0); i+=2) { srcPix = static_cast(srcImg.getPixel(0, i, 0)); dstPix[0] = static_cast (dstImg.getPixel(0, i, 0)); dstPix[1] = static_cast (dstImg.getPixel(1, i>>1, 0)); dstPix[2] = static_cast (dstImg.getPixel(2, i>>1, 0)); ippiCbYCr422ToYCbCr420_8u_C2P3R(srcPix, srcStep, dstPix, dstStep, sz); } [/cpp]
The TBB version looks like the following:
[cpp]class UYVYToI420_progressive { public: IppiSize sz; I32 srcStep; I32 dstStep[3]; UYVYImg *src; I420Img *dst; void operator() (const tbb::blocked_range&range) const { for (I32 i = range.begin(); i != range.end(); i++) { Ipp8u *srcPix = (Ipp8u*)src->getPixel(0, i*2, 0); Ipp8u *dstPix[3]; dstPix[0] = (Ipp8u*)dst->getPixel(0, i*2, 0); dstPix[1] = (Ipp8u*)dst->getPixel(1, (i*2)>> 1, 0); dstPix[2] = (Ipp8u*)dst->getPixel(2, (i*2)>> 1, 0); ippiCbYCr422ToYCbCr420_8u_C2P3R(srcPix, srcStep, dstPix, (I32*)dstStep, sz); } } }; // The parallel for call looks like this tbb::parallel_for(tbb::blocked_range (0, srcImg.getHeight(0)>>1, 64), conv); [/cpp]
I'm running this on a 1920x1080 image 1000 times and for some reason the TBB version is running about 4 times slower than the OpenMP version. I have tried various grain sizes and the auto_partitioner, but they all produce roughly the same results. Any ideas what I'm doing wrong?
Thanks,
Peter
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I figured out my problem which really indicates that my timing test is not very good: use of the affinity_partitioner seems to have fixed my problem. However, this brings me to another question. Is it valid or even a good idea to have an affinity_partitioner follow the data around, or should each parallel_for have it's own? It seems to me like the partitioning should be linked to the data instead of the loop, correct?
Thanks,
Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pvonkaenel
I think I figured out my problem which really indicates that my timing test is not very good: use of the affinity_partitioner seems to have fixed my problem. However, this brings me to another question. Is it valid or even a good idea to have an affinity_partitioner follow the data around, or should each parallel_for have it's own? It seems to me like the partitioning should be linked to the data instead of the loop, correct?
Thanks,
Peter
You are correct. Having one affinity_partitioner object to be passed to a series of parallel_for invocations is _the_ way to use it. This object accumulates the info about task-to-worker mapping from previous runs to repeat it later. If each invocation used temporary affinity_partitioner, it would basically the same as auto_partitioner, and not providebetter cache locality.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pvonkaenel
I think I figured out my problem which really indicates that my timing test is not very good: use of the affinity_partitioner seems to have fixed my problem. However, this brings me to another question. Is it valid or even a good idea to have an affinity_partitioner follow the data around, or should each parallel_for have it's own? It seems to me like the partitioning should be linked to the data instead of the loop, correct?
Thanks,
Peter
[cpp]tbb::affinity_partitioner part; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_rangeor are you talking about this:(0, M), body(), part); }[/cpp]
[cpp]tbb::affinity_partitioner part1, part2; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_rangeThe first one is the classical situation, where one would use affinity_partitioner - save the mapping from tasks to thread-ids during first iteration and then re-use this knowledge for the following num_iter-1 iterations. Second one however is the one showing that affinity_partitioner should really follow the data. Which one were you refering to?(0, M), body1(), part1); tbb::parallel_for(blocked_range (0, K), body2(), part2); }[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Anton Pegushin (Intel)
Just to make sure I understand the question. Are you talking about this:
[cpp]tbb::affinity_partitioner part; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_rangeor are you talking about this:(0, M), body(), part); }[/cpp]
[cpp]tbb::affinity_partitioner part1, part2; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_rangeThe first one is the classical situation, where one would use affinity_partitioner - save the mapping from tasks to thread-ids during first iteration and then re-use this knowledge for the following num_iter-1 iterations. Second one however is the one showing that affinity_partitioner should really follow the data. Which one were you refering to?(0, M), body1(), part1); tbb::parallel_for(blocked_range (0, K), body2(), part2); }[/cpp]
Actually, I'm talking about a hybrid between the two you list. If I have a block of data that gets processed by two separate parallel_for loops, can I have them share the partition like in the following:
[cpp] tbb::afinity_partitioner part; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_range(0, M), body1(), part); // Maybe do some serial work here tbb::parallel_for(blocked_range (0, M), body2(), part); } [/cpp]
In this case is it important to have the same blocked range size, or can it adapt? The key to this is that both loops are operating on the same data. Does this work, or will the two loops fight each other?
Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pvonkaenel
Actually, I'm talking about a hybrid between the two you list. If I have a block of data that gets processed by two separate parallel_for loops, can I have them share the partition like in the following:
[cpp] tbb::afinity_partitioner part; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_range(0, M), body1(), part); // Maybe do some serial work here tbb::parallel_for(blocked_range (0, M), body2(), part); } [/cpp]
In this case is it important to have the same blocked range size, or can it adapt? The key to this is that both loops are operating on the same data. Does this work, or will the two loops fight each other?
Peter
But this is theory, I guess. I'd go ahead and try comparing timings for your scenario (nesting for-loop with 2 parallel_for's) when it uses (a) auto_partitioner; (b) one affinity_partitioner; (c) two affinity_partitioners. Trying the above on several different platforms: 4-core vs 8-core for instance would give you an idea how noticeable is the effect of data locality and if load balancing stands in a way of affinity partitioning or not.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pvonkaenel
Actually, I'm talking about a hybrid between the two you list. If I have a block of data that gets processed by two separate parallel_for loops, can I have them share the partition like in the following
While Anton's verbose answer surely makes sense, let me also give you a shorter one :)
If both parallel loops would benefit from re-using the data hot in cache after the previous loop, then yes, use the same affinity partitioner object. E.g. Seismic example in TBB packages does exactly that.
Quoting - pvonkaenel
In this case is it important to have the same blocked range size, or can it adapt?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Anton and Alexey. I'll definitely experiment asI develop. It's nice to know that so many options are available.
Peter
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page