Re: My report about a TBB application

Bartlomiej · ‎08-03-2009

In your spare time ;-) you might have a look at the internal report of my Institute I published lately.
There is nothing ground breaking in it ;-) , but still it might be (I hope) interesting for TBB developers to see a "case study", how the library performs on a specific problem.

report-tbb.ps.gz

Any comments or remarks are welcome - in particular about the peculiar way the speedups are aligned to threads numbers.

Enjoy! ;-)

Alexey-Kukanov · ‎08-03-2009

For those of us poor Windows users who does not have a PS viewer at hands, would you mind attaching a PDF version of the paper? :)

Bartlomiej · ‎08-03-2009

Oops, sorry - here it goes: report-tbb.pdf

PS. The research is still on-going, if you'll be interested, I can inform you about further results.

mwhenson · ‎08-03-2009

Quoting - bartlomiej

Oops, sorry - here it goes: report-tbb.pdf

PS. The research is still on-going, if you'll be interested, I can inform you about further results.

That's very strange that you don't see any scalability beyond 7 threads. Do you use scalable_allocator?

Mike

Bartlomiej · ‎08-04-2009

Quoting - mwhenson

That's very strange that you don't see any scalability beyond 7 threads. Do you use scalable_allocator?

Mike

Why, yes, obviously! Isn't it clear enough - not only from the text, but also tables' captions?

As a matter of fact, it seems to me that it might be some problem with the memory allocation anyway. The scalable_allocator uses a mutex sometimes - in function getPublicFreeListBlock(). But is should happen very infrequently, shouldn't it?

Alexey-Kukanov · ‎08-04-2009

Quoting - bartlomiej

As a matter of fact, it seems to me that it might be some problem with the memory allocation anyway. The scalable_allocator uses a mutex sometimes - in function getPublicFreeListBlock(). But is should happen very infrequently, shouldn't it?

Yes there are a few places in the allocator where locks are used, but the locks are fine-grained (i.e. no long waits except for high contention), mostly distributed (i.e. contention is rarely an issue), and on cold paths. For the mentioned lock in getPublicFreeListBlock, all these properties hold. I do not say that the TBB allocator can never be a scalability bottleneck, but generally I would not expect it to be.

Assuming that no matter how many threads are there - 6, 7, or 8 - all of them are busy, the same wall clock time for 7 and 8 threads means about 15% increase of total clockticks for 8 threads. This amount should probably be visible in a profiling tool such as VTune or PTU. So running the app under the profiler would probably be the next thing I would do to understand the scalability issue.

Steve_Nuchia · ‎08-05-2009

Usually when scaling runs out of steam like that it's because you've maxed out some other resource, not because of a subtle bug. Example:

Say the algorithm is such that each thread uses one seventh of the available aggregate RAM bandwidth of the system. Then, with seven threads, you are just saturating the bandwidth, each thread stays busy, the caches stay hot, and all is well.

With eight threads, though, all eight start to experience stalls as they contend for the available bus cycles. Stalled threads lead to cold caches and overall performance starts to decay.

That's just an example. Actually the microarchitectural stalls in the RAM bandwidth example would not lead to the threads being rescheduled so that's probably not it. Page table contention would. Anyway, you get the idea.

When the problem fits in L1 cache for a long time it will scale well. If not, you need to understand where in the memory hierarchy it will actually bottleneck and design the parallelism around that.

Anton_M_Intel · ‎08-07-2009

Bartlomiej, thank you for the paper. I have a comment: you could use the new features of TBB 2.2 for "Thread-specific lists of solutions": enumerable_thread_specific or combinable.

robert_jay_gould · ‎08-10-2009

Quoting - Steve Nuchia

Usually when scaling runs out of steam like that it's because you've maxed out some other resource, not because of a subtle bug. Example:

Say the algorithm is such that each thread uses one seventh of the available aggregate RAM bandwidth of the system. Then, with seven threads, you are just saturating the bandwidth, each thread stays busy, the caches stay hot, and all is well.

With eight threads, though, all eight start to experience stalls as they contend for the available bus cycles. Stalled threads lead to cold caches and overall performance starts to decay.

That's just an example. Actually the microarchitectural stalls in the RAM bandwidth example would not lead to the threads being rescheduled so that's probably not it. Page table contention would. Anyway, you get the idea.

When the problem fits in L1 cache for a long time it will scale well. If not, you need to understand where in the memory hierarchy it will actually bottleneck and design the parallelism around that.

Steve this is a very good point. I've indeed found this sort of hardware bottlenecks every now and then, even though the theoretical usage should be much more scalable, hardware can get in the way. Especially when developing complex (business/game) applications, not threading benchmarks. Nevertheless the application in the paper is more of a benchmark IMHO, so I'm guessing there must be a coding error (well bottleneck) somewhere. On a 16 core machine I would expect the architecture to be good enough to scale more than 8x, unless the operation is IO bound (or has memory caching to disk)

Bartlomiej · ‎08-18-2009

Thanks for all the replies and suggestions - I've been out of town for two weeks.

Quoting - Anton Malakhov (Intel)

I have a comment: you could use the new features of TBB 2.2 for "Thread-specific lists of solutions":

Yes, but could you please give me an example of using the enumerable_thread_specific class? I found no examples anywhere and even complained about it in another thread.

I'm trying to do the profiling of the program now and going to test a few more variants next week.
Certainly, the problem is not with the memory overflow/caching, but there are some other possibilities...

Terry_W_Intel · ‎08-18-2009

Quoting - bartlomiej

Yes, but could you please give me an example of using the enumerable_thread_specific class? I found no examples anywhere and even complained about it in another thread.

I'm trying to do the profiling of the program now and going to test a few more variants next week.
Certainly, the problem is not with the memory overflow/caching, but there are some other possibilities...

There is an example in the Reference Manual, available from http://www.threadingbuildingblocks.org/documentation.php. Does that suffice?

Cheers,

Terry

Matthieu_Brucher · ‎08-18-2009

Quoting - Terry Wilmarth (Intel)

There is an example in the Reference Manual, available from http://www.threadingbuildingblocks.org/documentation.php. Does that suffice?

Cheers,

Terry

Hi,

I've tried to find this sample in the OS version of the reference manual, but there are no such things :| I'm also looking for examples of those new classes :)

robert-reed · ‎08-18-2009

Quoting - Matthieu Brucher

I've tried to find this sample in the OS version of the reference manual, but there are no such things :| I'm also looking for examples of those new classes :)

Perhaps you could try again? I just went to http://www.threadingbuildingblocks.org/documentation.php and found the aforementioned enumerable_thread_specific coding example on pages 98 and 99 of the reference manual under the Open Source documentation header. Not sure why you didn't find it.

Matthieu_Brucher · ‎08-19-2009

Quoting - Robert Reed (Intel)

Perhaps you could try again? I just went to http://www.threadingbuildingblocks.org/documentation.php and found the aforementioned enumerable_thread_specific coding example on pages 98 and 99 of the reference manual under the Open Source documentation header. Not sure why you didn't find it.

OK, I found it, sorry (but there ae no such example for combinable).

Is it possible to use fixed-size arrays for thread specific storage?

Bartlomiej · ‎08-19-2009

Quoting - Terry Wilmarth (Intel)

There is an example in the Reference Manual, available from http://www.threadingbuildingblocks.org/documentation.php. Does that suffice?

Dear Terry,
Thanks a lot! Yes, this example should do for me.

Dear Robert,
... but it wasn't there in the Reference Manual, I downloaded on the 4th of September. No, really, I can prove it. ;-)
It's good TBB is spreading so quickly.

Regards

Bartlomiej · ‎08-26-2009

Quoting - Bartlomiej

I'm trying to do the profiling of the program now and going to test a few more variants next week.
Certainly, the problem is not with the memory overflow/caching, but there are some other possibilities...

Aaargh...
The epilogue is really annoying.
When I did my (possibly quite clever) optimizations to the code, it occurred that the system administrator upgraded Fedora from 10 to 11, which includes changing GCC from version 4.3.2 to 4.4.1.
And now the results are as follows:
(i) the program scales quite well on all 16 cores (that's the nice part, but),
(ii) there is no significant difference between the carefully tuned variant using the scalable allocator, concurrent vector, etc. and the "raw" variant,
(iii) there is also no noticable difference between TBB and variants using OpenMP and Pthreads - far simpler and having only one common linked list, guarded by an ordinary mutex.

Darn, stupid compilers!

Anyway, does any of you have any idea, what changed the behavior of my programs so much?
*) Yes, they upgraded OMP from 2.5 to 3.0, but what affected the behavior of TBB?
*) Was it the compiler or possibly upgrading the Linux kernel from 2.6.27 to 2.6.29 (specifically, 2.6.29.6-217.2.8.fc11.x86_64)?
*) Why does the default allocator works so well now, that using the TBB scalable allocator gives no improvement? about memory allocation I could found nothingin the changelogs: http://gcc.gnu.org/gcc-4.4/changes.html and http://gcc.gnu.org/gcc-4.4/changes.html)
Any ideas? Thanks in advance!

Best regards