Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Troubleshooting performance issues in TBB


Is there any documentation available to help with troubleshooting performance issues when using TBB? (I haven't yet got the O'Reilly book, so that might be the short answer.)

I'm trying to use parallel_for and auto_partitioner to parallelise some existing calculation code under Windows XP. The modified code runs fine and Task Manager shows that all cores are being used throughout the calculation, but the runtime is slightly longer than the single-threaded version (the PC has two cores- one core plus hyperthreading).

Running under a profiler and looking at the wait_for_all function, roughly half the time is being spend inside my body function but half is "inside" the windows Sleep function, which is being called 200,000 of times compared to 93 calls to my body function.

The range is not completely uniform: it has 252 elements (taking just over a second) of which one block of 145 elements take 95% of the time.

Changing allocator, grainsize, partitioner and debug/release build doesn't seem to affect these relative times much, and DO_TBB_ASSERT doesn't flag any problems. Even runningon a four core PC gives no speed-up. I've generated a TBB_TRACE file but not sure how to interpret the output.

Any pointers welcome. I suspect the answer is (a) read the book and (b) build up / reduce to a simple example, but it's frustrating being so close and so I hoped someone might recognise the symptoms and be able to tell me what I've done wrong.



0 Kudos
4 Replies

From what you wrote, I would say there is serious load imbalance in the loop. "Half time inside the windows Sleep function" actually means workers did not succeed in finding job to steal (Sleep(0) is called to yield CPU time if several attempts to steal some job had no result). Might it happen that in reality 90+ % of the time is spend for processing 1-5 elements andnot evenly distributed across 145?

There are two things I would do:

  • Use the technique described by Kevin in this blog post, to better understand workload characteristics
  • Do not use auto_partitioner and try grainsize of 1. According to your data (relatively small # of iterations and relatively large time), the ratio of computation to iteration overhead should be big enough to justify this grainsize; and it gives best possible balancing. You might try to increase grainsize later if required.

Using these two options together, you might get the exact time spend at each iteration. Given the number of iterations, it is feasible to obtain, and analyze load balancing then.

I think your problem is that you don't really have any CPU to divide and conquer. Having a single core with hyper threading is not the same as having 2 cores. In fact, I would have predicted that your perf would be worse than that just using a single thread. You are essentially telling the system to divide and conquer the work, but you really only have one compute resource to execute. Thus, all you have done is add work to what you are tring to do, i.e. instead of just computing, you are computing and trying to divide the work.

Get a new computer with 2 or 4 cores and you will see some real performance gains.

Thanks, Alexey and JMB. The 'timestamp output+ grainsizeof 1' plan sounds good (and in hindsight embarassingly obvious), so I'll try that. I'd alreadyexperimented with simple_partitioner and grainsize 8 on a proper dual-core machine with no improvement, so the list must be very badly imbalanced in some subtle way.
Valued Contributor II
You might also take a look at my blog entries on analyzing TBB pipeline performance using Intel Thread Profiler and ITT event notifications, which is a great way to look at load balance issues, although I suspect JMB may have hit the right note. Hyper-Threading is a win when you have a mix of operations and can more fully utilize the functional units in the CPU by splitting it between a pair of threads, but if you're just pounding on specific units--like doing lots of floating point--you can saturate the out-of-order engine with justa single Hyper-Thread. Adding a second thread just adds more overhead. But first I would recommend you try Alexey's suggestion regarding grain size.