Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
2421 Discussions

Using parallel_for with multiple large datasets


I'm trying to use tbb::parallel_for for an image denoising application. My problem is with passing the necessary data structures into the parallel loop class.
Due to efficiency reasons, all the data structures are C style arrays., e.g.
[cpp]float* imageData;
Position* imagePos;
float** neighbourhoods;
bool* imageBorder;
int maxBlockSizeSqr, maxWindowSize;
float* gaussKern;
float noisyImageDeviation, filterParam;
Image* outImage;[/cpp]

while the rest of the data are of basic types (float, int, ...). Sizes of structures depend on input image parameters, so they can range from a few bytes to dozens of megabytes.

I tried passing them by-reference through the constructor into member variables as well as using a local object in the anonymous namespace to hold the data until it is read by/copied into the loop object. Both methods resulted in some pretty bad speed - much slower then not using TBB. I briefly checked on the available documents (Getting Started, Tutorials, Design Patterns) but could not find usefull information regarding large scale data passing.

Would someone please point out to me where to look for information regarding this issue ? Most likely I have missed some important information while viewing the documents. Also, what are the "best practices" (at least by name, so i can look them up) concerning this issue.

Thanks in advance,

PS: Sorry if this is the wrong forum, I only saw this one forum for TBB.
0 Kudos
2 Replies
Compilers tend to do much better with non-address-takenlocal variables and formal parameters than structure fields. The reason is that the compiler can analyze such variables much more precisely than it can address-taken variables or fields. Though advanced compilers can sometimes optimize address-taken variables if they can track all places the address might go to.

So what I would do is inside the functor is load all the values into local variables before executing the serial for loop over a subrange. Below is a sketch of how to go about this. The constructor for the loop body captures a pointer imageData in a member m_imageData. Then operator() loads the member back into a local pointer.

[cpp]struct body {
    float* m_imageData;
    body( float* imageData ) : m_imageData(imageData) {}
    void operator()( tbb::blocked_range& r ) const;

void body::operator()( tbb::blocked_range& r ) const {
    // Load pointer into local temporary
    float* imageData = m_imageData;
    int end = r.end();
    for( int i=r.begin(); i!=end; ++i ) {

void callsite( int n, float* imageData ) {
    tbb::parallel_for( tbb::blocked_range(0,n), body(imageData) );

Black Belt
You might also consider using something that will not require encapsulation objects and object operator functions such as Cilk++ (available in the new Parallel Studio), the old stalwart OpenMP, or QuickThread (my little project)

A rework of Arch's example using QuickThread:

[bash]void doWork( int iBegin, int iEnd, float* imageData) {
for( int i = iBegin, i < iEnd; ++i) {

void callsite( int n, float* imageData ) {
qt::parallel_for( doWork, 0, n, imageData); }
Jim Dempsey