Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2425 Discussions

parallel_reduce's join gives different results

Zhu_W_Intel
Employee
142 Views

I used a example parallel_reduce program. What I found is when I use :
tbb::task_scheduler_init init(4);
parallel_reduce's split and join functions are not called. My work load is not that big, so this might happen.
But if I switch to:
int n = task_scheduler_init::default_num_threads();
which is 6 when I print out, all the split and join are called.

This will cause further problem, as if I have this in join:

void join( const SumFoo& y ) {

std::cout<<"join "<<<" "<<&y<

my_sum+=y.my_sum/2;

}



I will get different results when join is called comparing to when join is not called.
Why is that? Isn't this dangerous as the logic can be wrong when join fucntion is skipped?
0 Kudos
5 Replies
RafSchietekat
Black Belt
142 Views
"my_sum+=y.my_sum/2" is not a valid reduction because it is not associative. With an associative operation, you would not care very much about split/join (other than for performance and differences at the limit of precision).
jimdempseyatthecove
Black Belt
142 Views
Even when the operations are associative the sequence in which the reductions are made can vary when using floats or doubles. When the data being "reduced" are approximations with rounded precision then the eventual result may vary in the lsb(s) depending on sequence of reduction. parallel_reduce join will produce equivilent results (within some epsilon).

Jim Dempsey
Zhu_W_Intel
Employee
142 Views
Thanks. But putting my_sum+=y.my_sum/2 is just my experiment.
My question is,
why with
"

tbb::task_scheduler_init init(4);


parallel_reduce's split and join functions are not called.

But if I switch to:

int n = task_scheduler_init::default_num_threads();

split and join are called. The work load is the same and I repeated tested many time and it is always the case.

RafSchietekat
Black Belt
142 Views
Try a range that's long enough and you'll probably have split/join with 4 threads as well. If pressed I would guess that with 6 threads an auto_partitioner (the default) generates more chunks than with 4 threads, giving more parallel overhead and more opportunities for a thief, but I can't be certain that this is the explanation here. It also doesn't seem that important, if you can confirm that there's not a lot of work to begin with.

(Added 2012-02-18) And of course that's exactly what you did: "My work load is not that big, so this might happen."
Vladimir_P_Intel2
142 Views
hello,

did you use tbb::parallel_deterministic_reduceto get these results? Simple reduce is not deterministic.
You can find it in Apendix D.3 in the reference.

--Vladimir
Reply