- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any documentation available to help with troubleshooting performance issues when using TBB? (I haven't yet got the O'Reilly book, so that might be the short answer.)
I'm trying to use parallel_for and auto_partitioner to parallelise some existing calculation code under Windows XP. The modified code runs fine and Task Manager shows that all cores are being used throughout the calculation, but the runtime is slightly longer than the single-threaded version (the PC has two cores- one core plus hyperthreading).
Running under a profiler and looking at the wait_for_all function, roughly half the time is being spend inside my body function but half is "inside" the windows Sleep function, which is being called 200,000 of times compared to 93 calls to my body function.
The range is not completely uniform: it has 252 elements (taking just over a second) of which one block of 145 elements take 95% of the time.
Changing allocator, grainsize, partitioner and debug/release build doesn't seem to affect these relative times much, and DO_TBB_ASSERT doesn't flag any problems. Even runningon a four core PC gives no speed-up. I've generated a TBB_TRACE file but not sure how to interpret the output.
Any pointers welcome. I suspect the answer is (a) read the book and (b) build up / reduce to a simple example, but it's frustrating being so close and so I hoped someone might recognise the symptoms and be able to tell me what I've done wrong.
Thanks
Bryan
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From what you wrote, I would say there is serious load imbalance in the loop. "Half time inside the windows Sleep function" actually means workers did not succeed in finding job to steal (Sleep(0) is called to yield CPU time if several attempts to steal some job had no result). Might it happen that in reality 90+ % of the time is spend for processing 1-5 elements andnot evenly distributed across 145?
There are two things I would do:
- Use the technique described by Kevin in this blog post, to better understand workload characteristics
- Do not use auto_partitioner and try grainsize of 1. According to your data (relatively small # of iterations and relatively large time), the ratio of computation to iteration overhead should be big enough to justify this grainsize; and it gives best possible balancing. You might try to increase grainsize later if required.
Using these two options together, you might get the exact time spend at each iteration. Given the number of iterations, it is feasible to obtain, and analyze load balancing then.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Get a new computer with 2 or 4 cores and you will see some real performance gains.
/JMB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page