parallel_for alignment question

bradleygibson · ‎06-12-2010

Hi, everyone,

I've just begun exploring TBB.

I would like to ensure that parallel_for never subdivides my loop index by anything not evenly divisible by 4. I tried setting the grain size and using simple_partitioner with a multiple-of-four grainsize but still no luck.

For example:

for ( i = 0; i < 1000; i += 4 )
{
// do interesting processing of four floats ( at &float_array[ i ], [ i+1 ], [ i+2 ] and [ i+3 ] )
}

When converted to parallel, all sub-ranges must be such that:
1) range.begin % 4 == 0
2) range.end % 4 == 0 (obviously, it'll be my responsiblity to ensure the loop I begin with is conformant)

I must be misunderstanding the purpose of grain_size--can anyone post me a code snippet illustrating a best-practices implementation (optimized for execution speed) of the above problem?

I have a simple workaround--simply divide the initial range by 4 and scale all indices within the parallel loops by the same amount, but that is not optimal for performance.

Thanks in advance,
-Brad

Alexey-Kukanov · ‎06-12-2010

Quoting bradleygibson

I have a simple workaround--simply divide the initial range by 4 and scale all indices within the parallel loops by the same amount, but that is not optimal for performance.

Why do you think it is not?

int i = 4*range.begin();
int end = 4*range.end();
for ( ; i < end; i += 4 )
{
// do interesting processing of four floats ( at &float_array[ i ], [ i+1 ], [ i+2 ] and [ i+3 ] )
}
What is suboptimal? Two additional left shifts (to multiply by four) per range will be barely perceptible in microbenchmarks I believe; in real programs, the difference will be within statistical uncertainty.

RafSchietekat · ‎06-13-2010

This looks like premature optimisation. I don't know for sure whether an optimising compiler would always come up with the technique indicated by Alexey in #1, but probably either it doesn't matter a lot anyway or the range is large enough that the alignment of the subranges is of little concern for performance with auto_partitioner now the default.

As for grainsize, it merely restricts division by 2 to ranges that are larger, so it will neither preserve alignment in this way nor guarantee that the final ranges are at least of size grainsize.

bradleygibson · ‎06-13-2010

Hi, Raf,

1) Can you explain your grainsize statement a bit further? (My understanding is unclear.) Grainsize merely restricts division by 2 to ranges that are larger than what? My indicated size?

So if I am understanding this correctly, setting a grainsize gives me an *upper* bound on the size of a range--ie. no larger than x?

2) Anyone out there with suggestions on how to achieve an aligned parallel for?

Thanks, everyone,
-Brad

bradleygibson · ‎06-13-2010

Alexey, I agree--the perf delta is likely negligible.

My goal, though, is to understand TBB, and I'm simply unsure of whether alignment of my parallel fors is possible or not. If it is not possible, I'll stick with the scaling technique I'm using now. But if it is, I'll understand TBB much better as I explore more complex problems.

Can you shed some light on this?

Thanks,
-Brad

RafSchietekat · ‎06-14-2010

"So if I am understanding this correctly, setting a grainsize gives me an *upper* bound on the size of a range--ie. no larger than x?"
With simple_partitioner, that would be correct. The lower limit would then be half the grainsize, but it's reached only if grainsize is even, from grainsize+1, by one of the parts, I think (please correct me if I should have given this a little more thought), unless the range already started out smaller than that of course. The default partitioner is now auto_partitioner, and it wisely tries to process larger sizes than that, with still the same ultimate limit on divisibility.

"Anyone out there with suggestions on how to achieve an aligned parallel for?"
Do you know for sure that this is important, with auto_partitioner and Amdahl involved?

bradleygibson · ‎06-14-2010

Hi, Raf,

Re: Partitioners:
You lost me a bit at the following phrase: "from grainsize + 1, by one of the parts..." Before that, my understanding was while grainsize is the upper bound for simple_partitioner the upper bound is larger for auto_partitioner. The lower bound is grainsize / 2 when grainsize is even. Did I miss anything?

Re: Alignment
I don't know for sure how important this might be, but at this stage, that is no longer the point. I'd simply like to understand if it's possible/how to do it, as it would improve my understading of the partitioner and of how grainsize is used.

I'd still love to hear from someone on a solution for this.

Thanks,
-Brad

RafSchietekat · ‎06-14-2010

"Did I miss anything?"
With auto_partitioner, the range is divided into at least something like a dozen times the default number of worker threads parts (if you can parse that sentence), with "a dozen times" used intentionally because it's not documented, so that would be the upper bound. The lower bound (if you start from a larger size) is always grainsize/2 of course, and if grainsize is uneven it is certainly exclusive, but otherwise it depends...

"I'd simply like to understand if it's possible/how to do it, as it would improve my understading of the partitioner and of how grainsize is used."
There's no explicit support for that, but the workaround has been indicated above. Up to you to decide whether you want to go that route, depending on the application. If you think it'll make a difference, I'd be interested to see the evidence, but my intuition says that it probably won't do much for you.

bradleygibson · ‎06-14-2010

Thank you, Raf! This all makes a lot more sense now.

-Brad