- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is nothing ground breaking in it ;-) , but still it might be (I hope) interesting for TBB developers to see a "case study", how the library performs on a specific problem.
report-tbb.ps.gz
Any comments or remarks are welcome - in particular about the peculiar way the speedups are aligned to threads numbers.
Enjoy! ;-)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PS. The research is still on-going, if you'll be interested, I can inform you about further results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PS. The research is still on-going, if you'll be interested, I can inform you about further results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why, yes, obviously! Isn't it clear enough - not only from the text, but also tables' captions?
As a matter of fact, it seems to me that it might be some problem with the memory allocation anyway. The scalable_allocator uses a mutex sometimes - in function getPublicFreeListBlock(). But is should happen very infrequently, shouldn't it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes there are a few places in the allocator where locks are used, but the locks are fine-grained (i.e. no long waits except for high contention), mostly distributed (i.e. contention is rarely an issue), and on cold paths. For the mentioned lock in getPublicFreeListBlock, all these properties hold. I do not say that the TBB allocator can never be a scalability bottleneck, but generally I would not expect it to be.
Assuming that no matter how many threads are there - 6, 7, or 8 - all of them are busy, the same wall clock time for 7 and 8 threads means about 15% increase of total clockticks for 8 threads. This amount should probably be visible in a profiling tool such as VTune or PTU. So running the app under the profiler would probably be the next thing I would do to understand the scalability issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Usually when scaling runs out of steam like that it's because you've maxed out some other resource, not because of a subtle bug. Example:
Say the algorithm is such that each thread uses one seventh of the available aggregate RAM bandwidth of the system. Then, with seven threads, you are just saturating the bandwidth, each thread stays busy, the caches stay hot, and all is well.
With eight threads, though, all eight start to experience stalls as they contend for the available bus cycles. Stalled threads lead to cold caches and overall performance starts to decay.
That's just an example. Actually the microarchitectural stalls in the RAM bandwidth example would not lead to the threads being rescheduled so that's probably not it. Page table contention would. Anyway, you get the idea.
When the problem fits in L1 cache for a long time it will scale well. If not, you need to understand where in the memory hierarchy it will actually bottleneck and design the parallelism around that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bartlomiej, thank you for the paper. I have a comment: you could use the new features of TBB 2.2 for "Thread-specific lists of solutions": enumerable_thread_specific or combinable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Usually when scaling runs out of steam like that it's because you've maxed out some other resource, not because of a subtle bug. Example:
Say the algorithm is such that each thread uses one seventh of the available aggregate RAM bandwidth of the system. Then, with seven threads, you are just saturating the bandwidth, each thread stays busy, the caches stay hot, and all is well.
With eight threads, though, all eight start to experience stalls as they contend for the available bus cycles. Stalled threads lead to cold caches and overall performance starts to decay.
That's just an example. Actually the microarchitectural stalls in the RAM bandwidth example would not lead to the threads being rescheduled so that's probably not it. Page table contention would. Anyway, you get the idea.
When the problem fits in L1 cache for a long time it will scale well. If not, you need to understand where in the memory hierarchy it will actually bottleneck and design the parallelism around that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Anton Malakhov (Intel)
I have a comment: you could use the new features of TBB 2.2 for "Thread-specific lists of solutions":
I'm trying to do the profiling of the program now and going to test a few more variants next week.
Certainly, the problem is not with the memory overflow/caching, but there are some other possibilities...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to do the profiling of the program now and going to test a few more variants next week.
Certainly, the problem is not with the memory overflow/caching, but there are some other possibilities...
Cheers,
Terry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Cheers,
Terry
I've tried to find this sample in the OS version of the reference manual, but there are no such things :| I'm also looking for examples of those new classes :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps you could try again? I just went to http://www.threadingbuildingblocks.org/documentation.php and found the aforementioned enumerable_thread_specific coding example on pages 98 and 99 of the reference manual under the Open Source documentation header. Not sure why you didn't find it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps you could try again? I just went to http://www.threadingbuildingblocks.org/documentation.php and found the aforementioned enumerable_thread_specific coding example on pages 98 and 99 of the reference manual under the Open Source documentation header. Not sure why you didn't find it.
Is it possible to use fixed-size arrays for thread specific storage?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Terry,
Thanks a lot! Yes, this example should do for me.
Dear Robert,
... but it wasn't there in the Reference Manual, I downloaded on the 4th of September. No, really, I can prove it. ;-)
It's good TBB is spreading so quickly.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Certainly, the problem is not with the memory overflow/caching, but there are some other possibilities...
Aaargh...
The epilogue is really annoying.
When I did my (possibly quite clever) optimizations to the code, it occurred that the system administrator upgraded Fedora from 10 to 11, which includes changing GCC from version 4.3.2 to 4.4.1.
And now the results are as follows:
(i) the program scales quite well on all 16 cores (that's the nice part, but),
(ii) there is no significant difference between the carefully tuned variant using the scalable allocator, concurrent vector, etc. and the "raw" variant,
(iii) there is also no noticable difference between TBB and variants using OpenMP and Pthreads - far simpler and having only one common linked list, guarded by an ordinary mutex.
Darn, stupid compilers!
Anyway, does any of you have any idea, what changed the behavior of my programs so much?
*) Yes, they upgraded OMP from 2.5 to 3.0, but what affected the behavior of TBB?
*) Was it the compiler or possibly upgrading the Linux kernel from 2.6.27 to 2.6.29 (specifically, 2.6.29.6-217.2.8.fc11.x86_64)?
*) Why does the default allocator works so well now, that using the TBB scalable allocator gives no improvement? about memory allocation I could found nothingin the changelogs: http://gcc.gnu.org/gcc-4.4/changes.html and http://gcc.gnu.org/gcc-4.4/changes.html)
Any ideas? Thanks in advance!
Best regards

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page