Slower than OpenMP case

pvonkaenel · ‎06-01-2009

Hi,

I have IPP based code which I have parallelized using OpenMP. After reading about TBB and its additional capabilities, I decided to try it replacing some OpenMP with TBB as a test. I think I have set things up correctly, and can see all 4 cores in my machine in use, but for some reason I cannot get the TBB version to run anywhere near as fast as the OpenMP version and don't understand why. My OpenMP version looks like the following:

[cpp]        #pragma omp parallel for private(srcPix, dstPix)
        for (I32 i = 0; i < srcImg.getHeight(0); i+=2) {
            srcPix = static_cast(srcImg.getPixel(0, i, 0));
            dstPix[0] = static_cast(dstImg.getPixel(0, i, 0));
            dstPix[1] = static_cast(dstImg.getPixel(1, i>>1, 0));
            dstPix[2] = static_cast(dstImg.getPixel(2, i>>1, 0));
            ippiCbYCr422ToYCbCr420_8u_C2P3R(srcPix, srcStep, dstPix, dstStep, sz);
        }
[/cpp]

The TBB version looks like the following:

[cpp]class UYVYToI420_progressive
{
public:
    IppiSize sz;
    I32 srcStep;
    I32 dstStep[3];
    UYVYImg *src;
    I420Img *dst;

    void operator() (const tbb::blocked_range &range) const
    {
        for (I32 i = range.begin(); i != range.end(); i++) {
            Ipp8u *srcPix = (Ipp8u*)src->getPixel(0, i*2, 0);
            Ipp8u *dstPix[3];
            dstPix[0] = (Ipp8u*)dst->getPixel(0, i*2, 0);
            dstPix[1] = (Ipp8u*)dst->getPixel(1, (i*2)>> 1, 0);
            dstPix[2] = (Ipp8u*)dst->getPixel(2, (i*2)>> 1, 0);
            ippiCbYCr422ToYCbCr420_8u_C2P3R(srcPix, srcStep, dstPix, (I32*)dstStep, sz);
        }
    }
};


// The parallel for call looks like this
        tbb::parallel_for(tbb::blocked_range(0, srcImg.getHeight(0)>>1, 64), conv);
[/cpp]

I'm running this on a 1920x1080 image 1000 times and for some reason the TBB version is running about 4 times slower than the OpenMP version. I have tried various grain sizes and the auto_partitioner, but they all produce roughly the same results. Any ideas what I'm doing wrong?

Thanks,
Peter

pvonkaenel · ‎06-01-2009

I think I figured out my problem which really indicates that my timing test is not very good: use of the affinity_partitioner seems to have fixed my problem. However, this brings me to another question. Is it valid or even a good idea to have an affinity_partitioner follow the data around, or should each parallel_for have it's own? It seems to me like the partitioning should be linked to the data instead of the loop, correct?

Thanks,
Peter

Alexey-Kukanov · ‎06-01-2009

Quoting - pvonkaenel

I think I figured out my problem which really indicates that my timing test is not very good: use of the affinity_partitioner seems to have fixed my problem. However, this brings me to another question. Is it valid or even a good idea to have an affinity_partitioner follow the data around, or should each parallel_for have it's own? It seems to me like the partitioning should be linked to the data instead of the loop, correct?

Thanks,
Peter

You are correct. Having one affinity_partitioner object to be passed to a series of parallel_for invocations is _the_ way to use it. This object accumulates the info about task-to-worker mapping from previous runs to repeat it later. If each invocation used temporary affinity_partitioner, it would basically the same as auto_partitioner, and not providebetter cache locality.

Anton_Pegushin · ‎06-02-2009

Quoting - pvonkaenel

I think I figured out my problem which really indicates that my timing test is not very good: use of the affinity_partitioner seems to have fixed my problem. However, this brings me to another question. Is it valid or even a good idea to have an affinity_partitioner follow the data around, or should each parallel_for have it's own? It seems to me like the partitioning should be linked to the data instead of the loop, correct?

Thanks,
Peter

Just to make sure I understand the question. Are you talking about this:

[cpp]tbb::affinity_partitioner part;

for (int i = 0; i < num_iter; ++i) {
    tbb::parallel_for(blocked_range(0, M), body(), part);
}[/cpp]

or are you talking about this:

[cpp]tbb::affinity_partitioner part1, part2;

for (int i = 0; i < num_iter; ++i) {
    tbb::parallel_for(blocked_range(0, M), body1(), part1);
    tbb::parallel_for(blocked_range(0, K), body2(), part2);
}[/cpp]

The first one is the classical situation, where one would use affinity_partitioner - save the mapping from tasks to thread-ids during first iteration and then re-use this knowledge for the following num_iter-1 iterations. Second one however is the one showing that affinity_partitioner should really follow the data. Which one were you refering to?

pvonkaenel · ‎06-02-2009

Quoting - Anton Pegushin (Intel)

Just to make sure I understand the question. Are you talking about this:

[cpp]tbb::affinity_partitioner part; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_range(0, M), body(), part); }[/cpp]
or are you talking about this:

[cpp]tbb::affinity_partitioner part1, part2; for (int i = 0; i < num_iter; ++i) { tbb::parallel_for(blocked_range(0, M), body1(), part1); tbb::parallel_for(blocked_range(0, K), body2(), part2); }[/cpp]
The first one is the classical situation, where one would use affinity_partitioner - save the mapping from tasks to thread-ids during first iteration and then re-use this knowledge for the following num_iter-1 iterations. Second one however is the one showing that affinity_partitioner should really follow the data. Which one were you refering to?

Actually, I'm talking about a hybrid between the two you list. If I have a block of data that gets processed by two separate parallel_for loops, can I have them share the partition like in the following:

[cpp]    tbb::afinity_partitioner part;

    for (int i = 0; i < num_iter; ++i) {
        tbb::parallel_for(blocked_range(0, M), body1(), part);
        // Maybe do some serial work here
        tbb::parallel_for(blocked_range(0, M), body2(), part);
    }
[/cpp]

In this case is it important to have the same blocked range size, or can it adapt? The key to this is that both loops are operating on the same data. Does this work, or will the two loops fight each other?

Peter

Anton_Pegushin · ‎06-03-2009

Quoting - pvonkaenel

Actually, I'm talking about a hybrid between the two you list. If I have a block of data that gets processed by two separate parallel_for loops, can I have them share the partition like in the following:

[cpp]    tbb::afinity_partitioner part;

    for (int i = 0; i < num_iter; ++i) {
        tbb::parallel_for(blocked_range(0, M), body1(), part);
        // Maybe do some serial work here
        tbb::parallel_for(blocked_range(0, M), body2(), part);
    }
[/cpp]

In this case is it important to have the same blocked range size, or can it adapt? The key to this is that both loops are operating on the same data. Does this work, or will the two loops fight each other?

Peter

As I understand it there is a catch to that. So yeah, affinity partitioner creates and uses a mapping from tasks to thread IDs to preserve data locality _but_ threads load balancing is still the first priority. Consider a situation with only one parallel_for nesting inside a for loop. Affinity partitioners mapping is created during the first iteration, but it's not carved in stone - if on the second iteration threads need to balance the load better (namely stealing goes a bit differently) the mapping will be updated and the newer version will be used during the third iteration. Now, two parallel_for's will have twice as much load balancing to do and this _might_ mess up the affinity partitioners mapping that they are sharing. In the case of two parallel_for Bodies being differently non-uniform (key word being "differently" here) each one of them will be trying to reuse the mapping from the previous call (other parallel_for), but will end up having a different stealing scheme and will be updating (read, messing up) the mapping for the other one. And that'll be happening for all iterations of the nesting for loop...

But this is theory, I guess. I'd go ahead and try comparing timings for your scenario (nesting for-loop with 2 parallel_for's) when it uses (a) auto_partitioner; (b) one affinity_partitioner; (c) two affinity_partitioners. Trying the above on several different platforms: 4-core vs 8-core for instance would give you an idea how noticeable is the effect of data locality and if load balancing stands in a way of affinity partitioning or not.

Alexey-Kukanov · ‎06-03-2009

Quoting - pvonkaenel

Actually, I'm talking about a hybrid between the two you list. If I have a block of data that gets processed by two separate parallel_for loops, can I have them share the partition like in the following

While Anton's verbose answer surely makes sense, let me also give you a shorter one :)

If both parallel loops would benefit from re-using the data hot in cache after the previous loop, then yes, use the same affinity partitioner object. E.g. Seismic example in TBB packages does exactly that.

Quoting - pvonkaenel

In this case is it important to have the same blocked range size, or can it adapt?

It is not necessary that the iteration space size is exactly the same; the partitioner will adapt. Be aware however that task affinity is defined by task position in the binary tree of recursive work splitting, not by range size of the leaf task.

pvonkaenel · ‎06-03-2009

Thank you Anton and Alexey. I'll definitely experiment asI develop. It's nice to know that so many options are available.

Peter