Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Employee
10 Views

parallel_reduce's join gives different results

I used a example parallel_reduce program. What I found is when I use :
tbb::task_scheduler_init init(4);
parallel_reduce's split and join functions are not called. My work load is not that big, so this might happen.
But if I switch to:
int n = task_scheduler_init::default_num_threads();
which is 6 when I print out, all the split and join are called.

This will cause further problem, as if I have this in join:

void join( const SumFoo& y ) {

std::cout<<"join "<<<" "<<&y<

my_sum+=y.my_sum/2;

}



I will get different results when join is called comparing to when join is not called.
Why is that? Isn't this dangerous as the logic can be wrong when join fucntion is skipped?
0 Kudos
5 Replies
Highlighted
Black Belt
10 Views

"my_sum+=y.my_sum/2" is not a valid reduction because it is not associative. With an associative operation, you would not care very much about split/join (other than for performance and differences at the limit of precision).
0 Kudos
Highlighted
10 Views

Even when the operations are associative the sequence in which the reductions are made can vary when using floats or doubles. When the data being "reduced" are approximations with rounded precision then the eventual result may vary in the lsb(s) depending on sequence of reduction. parallel_reduce join will produce equivilent results (within some epsilon).

Jim Dempsey
0 Kudos
Highlighted
Employee
10 Views

Thanks. But putting my_sum+=y.my_sum/2 is just my experiment.
My question is,
why with
"

tbb::task_scheduler_init init(4);


parallel_reduce's split and join functions are not called.

But if I switch to:

int n = task_scheduler_init::default_num_threads();

split and join are called. The work load is the same and I repeated tested many time and it is always the case.

0 Kudos
Highlighted
Black Belt
10 Views

Try a range that's long enough and you'll probably have split/join with 4 threads as well. If pressed I would guess that with 6 threads an auto_partitioner (the default) generates more chunks than with 4 threads, giving more parallel overhead and more opportunities for a thief, but I can't be certain that this is the explanation here. It also doesn't seem that important, if you can confirm that there's not a lot of work to begin with.

(Added 2012-02-18) And of course that's exactly what you did: "My work load is not that big, so this might happen."
0 Kudos
Highlighted
10 Views

hello,

did you use tbb::parallel_deterministic_reduceto get these results? Simple reduce is not deterministic.
You can find it in Apendix D.3 in the reference.

--Vladimir
0 Kudos