Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

parallel_reduce problem

pvonkaenel
New Contributor III
323 Views
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?

[cpp]class Sum
{
public:  // Methods
    Sum(Img *img) : m_img(img), m_sum(0) {}
    void operator () (const tbb::blocked_range &range)
    {
        IppiSize sz;
        Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0);
        I32 step = m_img->getStep(0);
        sz.width = m_img->getWidth(0);
        sz.height = range.size();

        if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
            throw std::runtime_error("ippiSum_16u_C1R failed!n");
        printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum);
    }
    Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {}
    void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
    F64 getSum() { return m_sum; }

private: // Attributes
    Img *m_img;
    F64 m_sum;
};
[/cpp]


If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.

Peter




0 Kudos
1 Solution
ROBERT_D_Intel1
Employee
323 Views
Quoting - pvonkaenel
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?

[cpp]class Sum
{
public:  // Methods
    Sum(Img *img) : m_img(img), m_sum(0) {}
    void operator () (const tbb::blocked_range &range)
    {
        IppiSize sz;
        Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0);
        I32 step = m_img->getStep(0);
        sz.width = m_img->getWidth(0);
        sz.height = range.size();

        if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
            throw std::runtime_error("ippiSum_16u_C1R failed!n");
        printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum);
    }
    Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {}
    void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
    F64 getSum() { return m_sum; }

private: // Attributes
    Img *m_img;
    F64 m_sum;
};
[/cpp]


If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.

Peter





Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");

to:

F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;

Don't feel bad. This may be one of the more obscure features of TBB.

And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.

Bob Davies

View solution in original post

0 Kudos
5 Replies
Alexey-Kukanov
Employee
323 Views
You do a mistake if you expect the number of join calls be equal to the number of operator() calls. But actually it equals to the number of split constructor calls, and body splitting is done lazily, only if adjacent ranges in the iteration space are processed by different threads.
So you need to accumulate partial "sums" in the body object, and merge those in join().
0 Kudos
ROBERT_D_Intel1
Employee
324 Views
Quoting - pvonkaenel
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?

[cpp]class Sum
{
public:  // Methods
    Sum(Img *img) : m_img(img), m_sum(0) {}
    void operator () (const tbb::blocked_range &range)
    {
        IppiSize sz;
        Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0);
        I32 step = m_img->getStep(0);
        sz.width = m_img->getWidth(0);
        sz.height = range.size();

        if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
            throw std::runtime_error("ippiSum_16u_C1R failed!n");
        printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum);
    }
    Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {}
    void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
    F64 getSum() { return m_sum; }

private: // Attributes
    Img *m_img;
    F64 m_sum;
};
[/cpp]


If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.

Peter





Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");

to:

F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;

Don't feel bad. This may be one of the more obscure features of TBB.

And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.

Bob Davies
0 Kudos
pvonkaenel
New Contributor III
323 Views

Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");

to:

F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;

Don't feel bad. This may be one of the more obscure features of TBB.

And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.

Bob Davies

Doh! That makes perfect sense, although I do not think I would have figured it out without your help.

I agree: mixing IPP with TBB has been working great for me. To make sure there are no conflicts, I have disabled the IPP threading layer (based on OpenMP) and rethreaded the routines I work with. Very powerful combination.

Thanks you very much,
Peter
0 Kudos
ROBERT_D_Intel1
Employee
323 Views

Peter - yes - turn off any OpenMP in the IPP libraries with SetNumThreads to 1. OpenMP does not share the thread dispatcher and may oversubscribe.

Bob Davies
0 Kudos
Alexey-Kukanov
Employee
323 Views
Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following
Thanks Bob. I tried to explain this same issue indeed, but you have done it much better and right to the point :)
0 Kudos
Reply