2466 Discussions

## parallel_reduce problem

New Contributor III
329 Views
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?

[cpp]class Sum
{
public:  // Methods
Sum(Img *img) : m_img(img), m_sum(0) {}
void operator () (const tbb::blocked_range &range)
{
IppiSize sz;
Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0);
I32 step = m_img->getStep(0);
sz.width = m_img->getWidth(0);
sz.height = range.size();

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum);
}
Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {}
void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
F64 getSum() { return m_sum; }

private: // Attributes
Img *m_img;
F64 m_sum;
};
[/cpp]

If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.

Peter

1 Solution
Employee
329 Views
Quoting - pvonkaenel
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?

[cpp]class Sum
{
public:  // Methods
Sum(Img *img) : m_img(img), m_sum(0) {}
void operator () (const tbb::blocked_range &range)
{
IppiSize sz;
Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0);
I32 step = m_img->getStep(0);
sz.width = m_img->getWidth(0);
sz.height = range.size();

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum);
}
Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {}
void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
F64 getSum() { return m_sum; }

private: // Attributes
Img *m_img;
F64 m_sum;
};
[/cpp]

If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.

Peter

Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");

to:

F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;

Don't feel bad. This may be one of the more obscure features of TBB.

And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.

Bob Davies
5 Replies
Employee
329 Views
You do a mistake if you expect the number of join calls be equal to the number of operator() calls. But actually it equals to the number of split constructor calls, and body splitting is done lazily, only if adjacent ranges in the iteration space are processed by different threads.
So you need to accumulate partial "sums" in the body object, and merge those in join().
Employee
330 Views
Quoting - pvonkaenel
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?

[cpp]class Sum
{
public:  // Methods
Sum(Img *img) : m_img(img), m_sum(0) {}
void operator () (const tbb::blocked_range &range)
{
IppiSize sz;
Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0);
I32 step = m_img->getStep(0);
sz.width = m_img->getWidth(0);
sz.height = range.size();

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum);
}
Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {}
void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
F64 getSum() { return m_sum; }

private: // Attributes
Img *m_img;
F64 m_sum;
};
[/cpp]

If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.

Peter

Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");

to:

F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;

Don't feel bad. This may be one of the more obscure features of TBB.

And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.

Bob Davies
New Contributor III
329 Views

Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:

if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");

to:

F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;

Don't feel bad. This may be one of the more obscure features of TBB.

And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.

Bob Davies

Doh! That makes perfect sense, although I do not think I would have figured it out without your help.

I agree: mixing IPP with TBB has been working great for me. To make sure there are no conflicts, I have disabled the IPP threading layer (based on OpenMP) and rethreaded the routines I work with. Very powerful combination.

Thanks you very much,
Peter
Employee
329 Views

Peter - yes - turn off any OpenMP in the IPP libraries with SetNumThreads to 1. OpenMP does not share the thread dispatcher and may oversubscribe.

Bob Davies
Employee
329 Views
Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following
Thanks Bob. I tried to explain this same issue indeed, but you have done it much better and right to the point :)