Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

parallel_reduce's join gives different results

Zhu_W_Intel
Employee
307 Views

I used a example parallel_reduce program. What I found is when I use :
tbb::task_scheduler_init init(4);
parallel_reduce's split and join functions are not called. My work load is not that big, so this might happen.
But if I switch to:
int n = task_scheduler_init::default_num_threads();
which is 6 when I print out, all the split and join are called.

This will cause further problem, as if I have this in join:

void join( const SumFoo& y ) {

std::cout<<"join "<<<" "<<&y<

my_sum+=y.my_sum/2;

}



I will get different results when join is called comparing to when join is not called.
Why is that? Isn't this dangerous as the logic can be wrong when join fucntion is skipped?
0 Kudos
5 Replies
RafSchietekat
Valued Contributor III
307 Views
"my_sum+=y.my_sum/2" is not a valid reduction because it is not associative. With an associative operation, you would not care very much about split/join (other than for performance and differences at the limit of precision).
0 Kudos
jimdempseyatthecove
Honored Contributor III
307 Views
Even when the operations are associative the sequence in which the reductions are made can vary when using floats or doubles. When the data being "reduced" are approximations with rounded precision then the eventual result may vary in the lsb(s) depending on sequence of reduction. parallel_reduce join will produce equivilent results (within some epsilon).

Jim Dempsey
0 Kudos
Zhu_W_Intel
Employee
307 Views
Thanks. But putting my_sum+=y.my_sum/2 is just my experiment.
My question is,
why with
"

tbb::task_scheduler_init init(4);


parallel_reduce's split and join functions are not called.

But if I switch to:

int n = task_scheduler_init::default_num_threads();

split and join are called. The work load is the same and I repeated tested many time and it is always the case.

0 Kudos
RafSchietekat
Valued Contributor III
307 Views
Try a range that's long enough and you'll probably have split/join with 4 threads as well. If pressed I would guess that with 6 threads an auto_partitioner (the default) generates more chunks than with 4 threads, giving more parallel overhead and more opportunities for a thief, but I can't be certain that this is the explanation here. It also doesn't seem that important, if you can confirm that there's not a lot of work to begin with.

(Added 2012-02-18) And of course that's exactly what you did: "My work load is not that big, so this might happen."
0 Kudos
Vladimir_P_1234567890
307 Views
hello,

did you use tbb::parallel_deterministic_reduceto get these results? Simple reduce is not deterministic.
You can find it in Apendix D.3 in the reference.

--Vladimir
0 Kudos
Reply