- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I've just begun exploring TBB.

I would like to ensure that parallel_for never subdivides my loop index by anything not evenly divisible by 4. I tried setting the grain size and using simple_partitioner with a multiple-of-four grainsize but still no luck.

For example:

for ( i = 0; i < 1000; i += 4 )

{

// do interesting processing of four floats ( at &float_array[ i ], [ i+1 ], [ i+2 ] and [ i+3 ] )

}

When converted to parallel, all sub-ranges must be such that:

1) range.begin % 4 == 0

2) range.end % 4 == 0 (obviously, it'll be my responsiblity to ensure the loop I begin with is conformant)

I must be misunderstanding the purpose of grain_size--can anyone post me a code snippet illustrating a best-practices implementation (optimized for execution speed) of the above problem?

I have a simple workaround--simply divide the initial range by 4 and scale all indices within the parallel loops by the same amount, but that is not optimal for performance.

Thanks in advance,

-Brad

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*I have a simple workaround--simply divide the initial range by 4 and scale all indices within the parallel loops by the same amount, but that is not optimal for performance.*

Why do you think it is not?

int i = 4*range.begin();

int end = 4*range.end();

for ( ; i < end; i += 4 )

{

// do interesting processing of four floats ( at &float_array[ i ], [ i+1 ], [ i+2 ] and [ i+3 ] )

}

What is suboptimal? Two additional left shifts (to multiply by four) per range will be barely perceptible in microbenchmarks I believe; in real programs, the difference will be within statistical uncertainty.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As for grainsize, it merely restricts division by 2 to ranges that are larger, so it will neither preserve alignment in this way nor guarantee that the final ranges are at least of size grainsize.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

1) Can you explain your grainsize statement a bit further? (My understanding is unclear.) Grainsize merely restricts division by 2 to ranges that are larger than what? My indicated size?

So if I am understanding this correctly, setting a grainsize gives me an *upper* bound on the size of a range--ie. no larger than x?

2) Anyone out there with suggestions on how to achieve an aligned parallel for?

Thanks, everyone,

-Brad

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

My goal, though, is to understand TBB, and I'm simply unsure of whether alignment of my parallel fors is possible or not. If it is not possible, I'll stick with the scaling technique I'm using now. But if it is, I'll understand TBB much better as I explore more complex problems.

Can you shed some light on this?

Thanks,

-Brad

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*"So if I am understanding this correctly, setting a grainsize gives me an *upper* bound on the size of a range--ie. no larger than x?"*With simple_partitioner, that would be correct. The lower limit would then be half the grainsize, but it's reached only if grainsize is even, from grainsize+1, by one of the parts, I think (please correct me if I should have given this a little more thought), unless the range already started out smaller than that of course. The default partitioner is now auto_partitioner, and it wisely tries to process larger sizes than that, with still the same ultimate limit on divisibility.

*"Anyone out there with suggestions on how to achieve an aligned parallel for?"*Do you know for sure that this is important, with auto_partitioner and Amdahl involved?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Re: Partitioners:**

You lost me a bit at the following phrase: "from grainsize + 1, by one of the parts..." Before that, my understanding was while grainsize is the upper bound for simple_partitioner the upper bound is larger for auto_partitioner. The lower bound is grainsize / 2 when grainsize is even. Did I miss anything?

Re: Alignment

Re: Alignment

I don't know for sure how important this might be, but at this stage, that is no longer the point. I'd simply like to understand if it's possible/how to do it, as it would improve my understading of the partitioner and of how grainsize is used.

I'd still love to hear from someone on a solution for this.

Thanks,

-Brad

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*"Did I miss anything?"*With auto_partitioner, the range is divided into at least something like a dozen times the default number of worker threads parts (if you can parse that sentence), with "a dozen times" used intentionally because it's not documented, so that would be the upper bound. The lower bound (if you start from a larger size) is always grainsize/2 of course, and if grainsize is uneven it is certainly exclusive, but otherwise it depends...

*"I'd simply like to understand if it's possible/how to do it, as it would improve my understading of the partitioner and of how grainsize is used."*There's no explicit support for that, but the workaround has been indicated above. Up to you to decide whether you want to go that route, depending on the application. If you think it'll make a difference, I'd be interested to see the evidence, but my intuition says that it probably won't do much for you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

-Brad

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page