- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would be grateful for a pointer how to debug the following:
I used TBB to parallelize a given routine. To verify the correctness and speed I developed it in a standalone unit test. After this was ready I integrated it into the original application but now the performance gain is much less than in my test program. Since the task is to move data around in memory ~1GB and more I think I'm safe to assume that I'm not seeing any cache effects. Also the program where it is integrated is not using multithreading at this point of execution so there is no overcommitment.
So far my debugging turned up that in my test scenario the 24 TBB threads each process 23-72 task of a total of 1024. In the application the distribution of tasks is 1-175 (11 threads processed just a single task).
Imporant note: I'm using parallel_for with the simple_partitioner, so all tasks are of comparable size.
Any idea what is going wrong in my application? Canthe TBB threads take too long to wake up? Any idea how to improve the performance?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are aware of the issue, and will address it in a future update. Meanwhile, this case just confirms that thread pinning is evil.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
more information after more debugging:
The CPU time and monotonic clock time are the same. I.e. only one core executes the 24 TBB threads. The remaining 23 cores are idle. :(
The best guess at what is happening so far is that GotoBLAS, which is statically linked into the final executable, initializes its worker threads and leaves the main thread pinned to one core. After that the TBB initialization creates its threads and does not clear the cpu pinning of the calling thread, thus pinning all worker threads to the same core.
What puzzles me, though, is that I tried the following two things and they didn't help:
- I added a sched_setaffinity call right before calling tbb::task_scheduler_init to clear the CPU pinning and reset it to the previous value after task_scheduler_init
- I patched the binary to change the order of initialization (i.e. first tbb::task_scheduler_init, then gotoblas_init)
Ideas how to make TBB actually use the other 23 cores?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest TBB becomes smart enough to handle a pinned main thread and remove the pinning for all spawned worker threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are aware of the issue, and will address it in a future update. Meanwhile, this case just confirms that thread pinning is evil.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can you say that?
Bad thread pinning is "evil".
The "fault" here is lack of documentation when using multiple multi-threaded libraries and how to construct work arounds.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can you say that?
Bad thread pinning is "evil".
Agree. OS theoretically unable to manage threads better than application, but application in many cases can manage threads better than OS.
And, of course, pinning all threads to one core is not a model example of good thread management. You should not do any conclusions based on this example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So ANY good application should be designed to be cooperative with regard to resource usage. Direct manipulation of thread-to-core affinity is not and cannot be cooperative. Ideally, an application should hint OS how to assign its threads, but should not force a particular pinning.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't agree with this part. There are many applications in HPC that are allowed to fully use the resources they were assigned as they see fit best. I.e. a batch scheduler is used to schedule processes.
In HPC, pinning the threads to cores can make a lot of sense as it can increase the efficiency. Since GotoBLAS is geared for HPC applications it pins its worker threads (and the main thread to stay away from the worker's cores).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TBB and Cilk are designed for the opposite casewhere a program is composed from multiple black boxes, and the resource needs of each box is not exposed.
Like Alexey said in a earlier note, we're working on making TBB behave more sensibly.
For a short-term work-around, task_scheduler_observer can be used to unpin worker threads. Create one with a on_scheduler_entry method that unpins the currentthread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are at least two things in the described affinity case that appear insane. First, it's insane for a new thread to inherit affinity from its creator thread (a sane behavior would be to inherit affinity of the process). Now I am aware of this bad behavior, and a workaround will be implemented in TBB.
The second strange thing is for GotoBLAS to set the affinity of the main thread, as if the library fully controlled every thing done there. If this isthe default behavior, it's insane; if it cannot be changed, it's even worse than that. A better behaviour, if affinity is so desired, would be to apply it at entering a library function, and revert at leaving thefunction. Now Matthias became aware of that, and has to implement some workarounds. The best thing IMO would be to prevent GotoBLAS from changing the main thread affinity at all, if possible.
Note that if any of these two behaviors was sane, Matthias won't have the problem, neither with TBB nor any otherthreading library.
Finally, when I said "know how to do things right" I meant "do affinity right". And that's far from being true for an average programmer, just because this is low-level system stuff that most people aren't experts in. The times when programmers knew everything about their computers have long gone; so have the times when programs were developed for a single computer configuration (and rewritten each time a configuration changed).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't agree with this part. There are many applications in HPC that are allowed to fully use the resources they were assigned as they see fit best. I.e. a batch scheduler is used to schedule processes.
In HPC, pinning the threads to cores can make a lot of sense as it can increase the efficiency. Since GotoBLAS is geared for HPC applications it pins its worker threads (and the main thread to stay away from the worker's cores).
Even in HPC, it sounds plain wrong for a library to assume that it always fully controls the machine. What if the application wants to apply parallelism at higher level than BLAS routines (which may well be more benefitial) and so limit BLAS level parallelism?A library enforcingan application to use a single core is much like thetail wagging the dog.
Keeping main thread to stay away from worker cores is something that an OS can do (and I believe modern OSes do) very easily. If worker threads leave at least one core free (which I think those do, otherwise where to pin the main thread?), the OS scheduler should be able to match a thread with a free core, rather than with a busy core.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compute class threads are pinned, one thread per hardware thread (you can override this)
I/O class threads not pinned (0 to n I/O class threads permitted).
Tasks can be scheduled:
without regard to affinity
with regard to affinity
with proximity of current thread
exclusive of proximity of current thread
to specific thread
distributed amongst caches, at different levels
to I/O class threads
The tasks can furthrer be classified as required to meet above chriteria or preferred to meet above chriteria. And can be based upon availability of threads in LIFO or FIFO or completion task order
Tasks are not pinned (not in the pthread sense), the threads in the thread pool are pinned and the tasks are scheduled as classified above (or not classified if you do not care).
All this can be done by placing a simple token on the template. This is not a complex method of programming, a first year CS student can master this technique.
On the MTL (Many Cores Testing Lab) 4P x 8 core x 2 SMT (small samples)
parallel_for( ...); // all threads
parallel_for( OneEach_L3$, ...); // one slice per socket
parallel_for( L3$, ...); // all threads in current socket
parallel_for( Waiting$ + L3$, ...); // current thread+ any waiting threads in this socket
parallel_for( NotMyCacheLevel$ + L3$, ...); // socket with most available threads at L3$
where ... are the remaining arguments to the template.
Affinity scheduling in the pthread senserequires atighter programming tollerance of thread start/stops.
Affinity scheduling on a task bases has less programming tolarance, parallel_task( L3$, ...); would mean (for MTL system) any one of 8 threads in current socket can take the task. This is much looser,on scheduling.
I think that most of the opinions expressed in this forum thread have been tempered by lack of ease in affinity based programming. Same can be said of my post from the perspective of having the ease in affinity based programming.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TBB and Cilk are designed for the opposite casewhere a program is composed from multiple black boxes, and the resource needs of each box is not exposed.
For my use case where I "own the machine" OpenMP performs much worse than TBB. The TBB scheduler and the parallel_for partitioning options make TBB the winner over OpenMP.
I just would like to let you know that TBB has great potential for these scenarios and that you should not exclude these use cases because they seem covered by OpenMP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The process affinity mask can be obtained at TBB initialization (before worker threads are created), but if at that time the main thread has already pinned to a core,the mask will be of no help for the issue. Another way is to set the mask for new threads to include all cores; but ignoring the process mask, despite being the current behavior, is rather bad and I would like to fix it, asthere might be reasons for a user to restrict a process to just a subset of cores. So the process mask capturedat TBB initialization stillseems the best bet; andif the application restricts affinity of its main thread, that should be done only after TBB was initialized. E.g. with such solution implemented in TBB Matthias would still see the issue in his app, but putting TBB initialization before GotoBLAS initialization would help.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page