topic Very bad scaling with many cores in IntelĀ® oneAPI Threading Building Blocks & IntelĀ® Threading Building Blocks
https://community.intel.com/t5/Intel-oneAPI-Threading-Building/Very-bad-scaling-with-many-cores/m-p/1177038#M14590
<P>Hello everyone!</P><P>First of all I would like to thank everyone for the wonderful TBB library; this library looks very promising and I am right now prototyping its use for the open-source project Stan which is a MCMC program.</P><P>The key bottleneck of Stan is the calculation of the log-likelihood and its gradients with respect to the parameters. I have successfully implemented a toy example which evaluates iteratively the log-likelihood of a Poisson example. While the toy example performs nicely with these timings (units are ns):</P><P>1 cores: BM_tbbM_median 501958 ns 500734 ns 1208<BR />2 cores: BM_tbbM_median 281973 ns 279780 ns 2413<BR />4 cores: BM_tbbM_median 177745 ns 176584 ns 3890<BR />6 cores: BM_tbbM_median 146703 ns 145824 ns 4433</P><P>I am getting terrible performance when running the same thing in the actual application:</P><P>cores=1<BR /> 37.30 real 37.11 user 0.07 sys</P><P>cores=2<BR /> 21.50 real 42.35 user 0.26 sys</P><P>cores=3<BR /> 17.23 real 50.69 user 0.44 sys</P><P>cores=4<BR /> 44.45 real 174.29 user 2.17 sys</P><P>cores=6<BR /> 241.85 real 1270.00 user 52.29 sys<BR /> </P><P>So you see that with 2-3 cores things speed up, but then the execution times explode. The real example includes 4000 terms and I have set a grainsize of 100 (same results basically with a grainsize of 1000).</P><P>I am really lost here at the moment as to why this happens. It looks to me as if the TBB scheduler gets totally off the rails due to the longer breaks in between the evaluations. So I was wondering if putting threads to sleep can be avoided - but I am really guessing into the dark here.</P><P>The toy example is here: https://github.com/wds15/perf-math/blob/tbb/tbb-scale.cpp#L124</P><P>The TBB parallel_reduce is here: https://github.com/stan-dev/math/blob/1b6abbfc389cb8bfd803b5bac759dbd196f41672/stan/math/rev/scal/functor/parallel_reduce_sum.hpp#L107</P><P>The actual application code pieces would be here: https://discourse.mc-stan.org/t/proposed-parallelism-rfc-stan-language-bits/9477 (but that's maybe not too helpful)</P><P>I would very much appreciate any hints of how to debug this. In case more information is needed, please let me know.</P><P>Many thanks in advance.</P><P>Sebastian</P>Thu, 04 Jul 2019 18:06:16 GMTWeber__Sebastian2019-07-04T18:06:16ZVery bad scaling with many cores
https://community.intel.com/t5/Intel-oneAPI-Threading-Building/Very-bad-scaling-with-many-cores/m-p/1177038#M14590
<P>Hello everyone!</P><P>First of all I would like to thank everyone for the wonderful TBB library; this library looks very promising and I am right now prototyping its use for the open-source project Stan which is a MCMC program.</P><P>The key bottleneck of Stan is the calculation of the log-likelihood and its gradients with respect to the parameters. I have successfully implemented a toy example which evaluates iteratively the log-likelihood of a Poisson example. While the toy example performs nicely with these timings (units are ns):</P><P>1 cores: BM_tbbM_median 501958 ns 500734 ns 1208<BR />2 cores: BM_tbbM_median 281973 ns 279780 ns 2413<BR />4 cores: BM_tbbM_median 177745 ns 176584 ns 3890<BR />6 cores: BM_tbbM_median 146703 ns 145824 ns 4433</P><P>I am getting terrible performance when running the same thing in the actual application:</P><P>cores=1<BR /> 37.30 real 37.11 user 0.07 sys</P><P>cores=2<BR /> 21.50 real 42.35 user 0.26 sys</P><P>cores=3<BR /> 17.23 real 50.69 user 0.44 sys</P><P>cores=4<BR /> 44.45 real 174.29 user 2.17 sys</P><P>cores=6<BR /> 241.85 real 1270.00 user 52.29 sys<BR /> </P><P>So you see that with 2-3 cores things speed up, but then the execution times explode. The real example includes 4000 terms and I have set a grainsize of 100 (same results basically with a grainsize of 1000).</P><P>I am really lost here at the moment as to why this happens. It looks to me as if the TBB scheduler gets totally off the rails due to the longer breaks in between the evaluations. So I was wondering if putting threads to sleep can be avoided - but I am really guessing into the dark here.</P><P>The toy example is here: https://github.com/wds15/perf-math/blob/tbb/tbb-scale.cpp#L124</P><P>The TBB parallel_reduce is here: https://github.com/stan-dev/math/blob/1b6abbfc389cb8bfd803b5bac759dbd196f41672/stan/math/rev/scal/functor/parallel_reduce_sum.hpp#L107</P><P>The actual application code pieces would be here: https://discourse.mc-stan.org/t/proposed-parallelism-rfc-stan-language-bits/9477 (but that's maybe not too helpful)</P><P>I would very much appreciate any hints of how to debug this. In case more information is needed, please let me know.</P><P>Many thanks in advance.</P><P>Sebastian</P>Thu, 04 Jul 2019 18:06:16 GMThttps://community.intel.com/t5/Intel-oneAPI-Threading-Building/Very-bad-scaling-with-many-cores/m-p/1177038#M14590Weber__Sebastian2019-07-04T18:06:16ZThough it's a while ago, I
https://community.intel.com/t5/Intel-oneAPI-Threading-Building/Very-bad-scaling-with-many-cores/m-p/1177039#M14591
<P>Though it's a while ago, I wanted to come back to this. It turned out that I have apparently used the Intel TBB API wrongly which drove my message above.</P><P>By now the parallel reduce of the TBB has been successfully integrated into the Stan open-source software. Stan itself is a MCMC sampler tailored to solve Bayesian problems, i.e. obtain the posterior sample for some statistical model given observed data.</P><P>What made it really special to integrate the Intel TBB with Stan is the requirement to make the TBB interact with the automatic differentiation library used to calculate the gradient of the log likelihood function evaluated in Stan models. Obviously, the automatic differentiation works for any other programs as well. The complication stemmed from the fact that the autodiff library relies on a thread-local autodiff tape such that making things safe when using this in the context of the task based Intel TBB wasn't straightforward.</P><P>Anyway, the Intel TBB made it really easy to leverage multiple cores and the speedups we are seeing are really good.</P><P>The facility using parallel_reduce is wrapped into the Stan function called "reduce_sum" currently in a release candidate of Stan: https://discourse.mc-stan.org/t/cmdstan-2-23-release-candidate-is-available/14301</P><P>and here is a small document introducing the use of it: https://github.com/stan-dev/docs/blob/344d3d23cba77bf7178fb3b6241ac5f3dc2321af/src/stan-users-guide/parallelization.Rmd</P><P>So thanks for all the work on the TBB!</P>Wed, 15 Apr 2020 14:22:06 GMThttps://community.intel.com/t5/Intel-oneAPI-Threading-Building/Very-bad-scaling-with-many-cores/m-p/1177039#M14591Weber__Sebastian2020-04-15T14:22:06Z