- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?
If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.
Peter
[cpp]class Sum { public: // Methods Sum(Img *img) : m_img(img), m_sum(0) {} void operator () (const tbb::blocked_range&range) { IppiSize sz; Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0); I32 step = m_img->getStep(0); sz.width = m_img->getWidth(0); sz.height = range.size(); if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum)) throw std::runtime_error("ippiSum_16u_C1R failed!n"); printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum); } Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {} void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; } F64 getSum() { return m_sum; } private: // Attributes Img *m_img; F64 m_sum; }; [/cpp]
If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.
Peter
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pvonkaenel
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?
If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.
Peter
[cpp]class Sum { public: // Methods Sum(Img *img) : m_img(img), m_sum(0) {} void operator () (const tbb::blocked_range&range) { IppiSize sz; Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0); I32 step = m_img->getStep(0); sz.width = m_img->getWidth(0); sz.height = range.size(); if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum)) throw std::runtime_error("ippiSum_16u_C1R failed!n"); printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum); } Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {} void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; } F64 getSum() { return m_sum; } private: // Attributes Img *m_img; F64 m_sum; }; [/cpp]
If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.
Peter
Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
to:
F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;
Don't feel bad. This may be one of the more obscure features of TBB.
And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.
Bob Davies
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You do a mistake if you expect the number of join calls be equal to the number of operator() calls. But actually it equals to the number of split constructor calls, and body splitting is done lazily, only if adjacent ranges in the iteration space are processed by different threads.
So you need to accumulate partial "sums" in the body object, and merge those in join().
So you need to accumulate partial "sums" in the body object, and merge those in join().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pvonkaenel
I know I must be doing something really dumb, but I have not been able to figure out what. I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane). Sounds simple enough, but it looks like join() is not being called enough times. Do I ever need to explicitly call join() or does the system always call it?
If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.
Peter
[cpp]class Sum { public: // Methods Sum(Img *img) : m_img(img), m_sum(0) {} void operator () (const tbb::blocked_range&range) { IppiSize sz; Ipp16u *pSrc = (Ipp16u*)m_img->getPixel(0, range.begin(), 0); I32 step = m_img->getStep(0); sz.width = m_img->getWidth(0); sz.height = range.size(); if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum)) throw std::runtime_error("ippiSum_16u_C1R failed!n"); printf("Sum for %d %d = %0.1fn", range.begin(), range.end(), m_sum); } Sum(Sum &x, tbb::split) : m_img(x.m_img), m_sum(0) {} void join(const Sum &y) { printf("%0.1f = %0.1f + %0.1fn", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; } F64 getSum() { return m_sum; } private: // Attributes Img *m_img; F64 m_sum; }; [/cpp]
If I call operator() directly it works fine. As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer. Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.
Peter
Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
to:
F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;
Don't feel bad. This may be one of the more obscure features of TBB.
And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.
Bob Davies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Robert Davies (Intel)
Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following:
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &m_sum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
to:
F64 tmpSum;
if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &tmpSum))
throw std::runtime_error("ippiSum_16u_C1R failed!n");
m_sum += tmpSum;
Don't feel bad. This may be one of the more obscure features of TBB.
And combining IPP and TBB may be your best idea. I am doing the same and seeing significant performance advantages.
Bob Davies
Doh! That makes perfect sense, although I do not think I would have figured it out without your help.
I agree: mixing IPP with TBB has been working great for me. To make sure there are no conflicts, I have disabled the IPP threading layer (based on OpenMP) and rethreaded the routines I work with. Very powerful combination.
Thanks you very much,
Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter - yes - turn off any OpenMP in the IPP libraries with SetNumThreads to 1. OpenMP does not share the thread dispatcher and may oversubscribe.
Bob Davies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Bob Davies (Intel)
Peter - you are leaving out some of the leaf nodes because the task block is being reused before a join. To fix the problem, change the following
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page