Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Very bad scaling with many cores

Weber__Sebastian
Beginner
855 Views

Hello everyone!

First of all I would like to thank everyone for the wonderful TBB library; this library looks very promising and I am right now prototyping its use for the open-source project Stan which is a MCMC program.

The key bottleneck of Stan is the calculation of the log-likelihood and its gradients with respect to the parameters. I have successfully implemented a toy example which evaluates iteratively the log-likelihood of a Poisson example. While the toy example performs nicely with these timings (units are ns):

1 cores: BM_tbbM_median        501958 ns     500734 ns       1208
2 cores: BM_tbbM_median        281973 ns     279780 ns       2413
4 cores: BM_tbbM_median        177745 ns     176584 ns       3890
6 cores: BM_tbbM_median        146703 ns     145824 ns       4433

I am getting terrible performance when running the same thing in the actual application:

cores=1
       37.30 real        37.11 user         0.07 sys

cores=2
       21.50 real        42.35 user         0.26 sys

cores=3
       17.23 real        50.69 user         0.44 sys

cores=4
       44.45 real       174.29 user         2.17 sys

cores=6
      241.85 real      1270.00 user        52.29 sys
 

So you see that with 2-3 cores things speed up, but then the execution times explode. The real example includes 4000 terms and I have set a grainsize of 100 (same results basically with a grainsize of 1000).

I am really lost here at the moment as to why this happens. It looks to me as if the TBB scheduler gets totally off the rails due to the longer breaks in between the evaluations. So I was wondering if putting threads to sleep can be avoided - but I am really guessing into the dark here.

The toy example is here: https://github.com/wds15/perf-math/blob/tbb/tbb-scale.cpp#L124

The TBB parallel_reduce is here: https://github.com/stan-dev/math/blob/1b6abbfc389cb8bfd803b5bac759dbd196f41672/stan/math/rev/scal/functor/parallel_reduce_sum.hpp#L107

The actual application code pieces would be here: https://discourse.mc-stan.org/t/proposed-parallelism-rfc-stan-language-bits/9477 (but that's maybe not too helpful)

I would very much appreciate any hints of how to debug this. In case more information is needed, please let me know.

Many thanks in advance.

Sebastian

0 Kudos
1 Reply
Weber__Sebastian
Beginner
855 Views

Though it's a while ago, I wanted to come back to this. It turned out that I have apparently used the Intel TBB API wrongly which drove my message above.

By now the parallel reduce of the TBB has been successfully integrated into the Stan open-source software. Stan itself is a MCMC sampler tailored to solve Bayesian problems, i.e. obtain the posterior sample for some statistical model given observed data.

What made it really special to integrate the Intel TBB with Stan is the requirement to make the TBB interact with the automatic differentiation library used to calculate the gradient of the log likelihood function evaluated in Stan models. Obviously, the automatic differentiation works for any other programs as well. The complication stemmed from the fact that the autodiff library relies on a thread-local autodiff tape such that making things safe when using this in the context of the task based Intel TBB wasn't straightforward.

Anyway, the Intel TBB made it really easy to leverage multiple cores and the speedups we are seeing are really good.

The facility using parallel_reduce is wrapped into the Stan function called "reduce_sum" currently in a release candidate of Stan: https://discourse.mc-stan.org/t/cmdstan-2-23-release-candidate-is-available/14301

and here is a small document introducing the use of it: https://github.com/stan-dev/docs/blob/344d3d23cba77bf7178fb3b6241ac5f3dc2321af/src/stan-users-guide/parallelization.Rmd

So thanks for all the work on the TBB!

0 Kudos
Reply