Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

uneven task distribution kills performance

Matthias_Kretz
New Contributor I
597 Views
Hi,
I would be grateful for a pointer how to debug the following:

I used TBB to parallelize a given routine. To verify the correctness and speed I developed it in a standalone unit test. After this was ready I integrated it into the original application but now the performance gain is much less than in my test program. Since the task is to move data around in memory ~1GB and more I think I'm safe to assume that I'm not seeing any cache effects. Also the program where it is integrated is not using multithreading at this point of execution so there is no overcommitment.

So far my debugging turned up that in my test scenario the 24 TBB threads each process 23-72 task of a total of 1024. In the application the distribution of tasks is 1-175 (11 threads processed just a single task).
Imporant note: I'm using parallel_for with the simple_partitioner, so all tasks are of comparable size.

Any idea what is going wrong in my application? Canthe TBB threads take too long to wake up? Any idea how to improve the performance?
0 Kudos
1 Solution
Alexey-Kukanov
Employee
597 Views
TBB threads are created lazily since the latest 3.0 update - i.e. only when the work arrives. This is why the fake parallel_for loop helped, while placement of init calls etc. did not.

We are aware of the issue, and will address it in a future update. Meanwhile, this case just confirms that thread pinning is evil.

View solution in original post

0 Kudos
14 Replies
Matthias_Kretz
New Contributor I
597 Views

more information after more debugging:

The CPU time and monotonic clock time are the same. I.e. only one core executes the 24 TBB threads. The remaining 23 cores are idle. :(

The best guess at what is happening so far is that GotoBLAS, which is statically linked into the final executable, initializes its worker threads and leaves the main thread pinned to one core. After that the TBB initialization creates its threads and does not clear the cpu pinning of the calling thread, thus pinning all worker threads to the same core.

What puzzles me, though, is that I tried the following two things and they didn't help:

  1. I added a sched_setaffinity call right before calling tbb::task_scheduler_init to clear the CPU pinning and reset it to the previous value after task_scheduler_init
  2. I patched the binary to change the order of initialization (i.e. first tbb::task_scheduler_init, then gotoblas_init)

Ideas how to make TBB actually use the other 23 cores?

0 Kudos
Matthias_Kretz
New Contributor I
597 Views
Got it working now by adding a fake tbb::parallel_for right after task_scheduler_init with sched_setaffinity clearing the pinning for the two calls and setting it back afterwards.

I suggest TBB becomes smart enough to handle a pinned main thread and remove the pinning for all spawned worker threads.
0 Kudos
Alexey-Kukanov
Employee
598 Views
TBB threads are created lazily since the latest 3.0 update - i.e. only when the work arrives. This is why the fake parallel_for loop helped, while placement of init calls etc. did not.

We are aware of the issue, and will address it in a future update. Meanwhile, this case just confirms that thread pinning is evil.
0 Kudos
jimdempseyatthecove
Honored Contributor III
597 Views
>>this case just confirms that thread pinning is evil.

How can you say that?

Bad thread pinning is "evil".

The "fault" here is lack of documentation when using multiple multi-threaded libraries and how to construct work arounds.

Jim Dempsey
0 Kudos
Dmitry_Vyukov
Valued Contributor I
597 Views
>>this case just confirms that thread pinning is evil.

How can you say that?

Bad thread pinning is "evil".

Agree. OS theoretically unable to manage threads better than application, but application in many cases can manage threads better than OS.

And, of course, pinning all threads to one core is not a model example of good thread management. You should not do any conclusions based on this example.

0 Kudos
Alexey-Kukanov
Employee
597 Views
This is only true when the application developer controls everything and knows how to do things right, which is a rare case in the modern programming. Usually, programs are composed from components/libraries developed by different people, and component/library developers might now know what is right for the application, and what is not. Moreover, modern computing environments assume resource sharing between applications and/or users - so actually OS may know better how to assign threads.

So ANY good application should be designed to be cooperative with regard to resource usage. Direct manipulation of thread-to-core affinity is not and cannot be cooperative. Ideally, an application should hint OS how to assign its threads, but should not force a particular pinning.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
597 Views
Then you should say thread pinning is evil for developers who do not control their programs and do not know how to do things right. This statement has no objections from my side :)
0 Kudos
Matthias_Kretz
New Contributor I
597 Views
So ANY good application should be designed to be cooperative with regard to resource usage. Direct manipulation of thread-to-core affinity is not and cannot be cooperative. Ideally, an application should hint OS how to assign its threads, but should not force a particular pinning.


I don't agree with this part. There are many applications in HPC that are allowed to fully use the resources they were assigned as they see fit best. I.e. a batch scheduler is used to schedule processes.
In HPC, pinning the threads to cores can make a lot of sense as it can increase the efficiency. Since GotoBLAS is geared for HPC applications it pins its worker threads (and the main thread to stay away from the worker's cores).

0 Kudos
ARCH_R_Intel
Employee
597 Views
It comes down to whether your program "owns the machine" or not. Pinning makes sense when a program owns the machine and the programmer can accurately orchestrate resource usage across the system. I've heard some shops reboot the machine before a run just to be surenothing else is running on it. OpenMP is a good fit for "own the machine" scenarios".

TBB and Cilk are designed for the opposite casewhere a program is composed from multiple black boxes, and the resource needs of each box is not exposed.

Like Alexey said in a earlier note, we're working on making TBB behave more sensibly.

For a short-term work-around, task_scheduler_observer can be used to unpin worker threads. Create one with a on_scheduler_entry method that unpins the currentthread.
0 Kudos
Alexey-Kukanov
Employee
597 Views
C'mon Dmitry, you never control every single bit of your program, you use tools and libraries. And you trust that the behavior of pieces out of your control is sane. If you found somethinginsane, and you can't change or influence it, you do workarounds. So do I, and likely doesevery programmer.

There are at least two things in the described affinity case that appear insane. First, it's insane for a new thread to inherit affinity from its creator thread (a sane behavior would be to inherit affinity of the process). Now I am aware of this bad behavior, and a workaround will be implemented in TBB.

The second strange thing is for GotoBLAS to set the affinity of the main thread, as if the library fully controlled every thing done there. If this isthe default behavior, it's insane; if it cannot be changed, it's even worse than that. A better behaviour, if affinity is so desired, would be to apply it at entering a library function, and revert at leaving thefunction. Now Matthias became aware of that, and has to implement some workarounds. The best thing IMO would be to prevent GotoBLAS from changing the main thread affinity at all, if possible.

Note that if any of these two behaviors was sane, Matthias won't have the problem, neither with TBB nor any otherthreading library.

Finally, when I said "know how to do things right" I meant "do affinity right". And that's far from being true for an average programmer, just because this is low-level system stuff that most people aren't experts in. The times when programmers knew everything about their computers have long gone; so have the times when programs were developed for a single computer configuration (and rewritten each time a configuration changed).
0 Kudos
Alexey-Kukanov
Employee
597 Views
So ANY good application should be designed to be cooperative with regard to resource usage. Direct manipulation of thread-to-core affinity is not and cannot be cooperative. Ideally, an application should hint OS how to assign its threads, but should not force a particular pinning.


I don't agree with this part. There are many applications in HPC that are allowed to fully use the resources they were assigned as they see fit best. I.e. a batch scheduler is used to schedule processes.
In HPC, pinning the threads to cores can make a lot of sense as it can increase the efficiency. Since GotoBLAS is geared for HPC applications it pins its worker threads (and the main thread to stay away from the worker's cores).


Even in HPC, it sounds plain wrong for a library to assume that it always fully controls the machine. What if the application wants to apply parallelism at higher level than BLAS routines (which may well be more benefitial) and so limit BLAS level parallelism?A library enforcingan application to use a single core is much like thetail wagging the dog.

Keeping main thread to stay away from worker cores is something that an OS can do (and I believe modern OSes do) very easily. If worker threads leave at least one core free (which I think those do, otherwise where to pin the main thread?), the OS scheduler should be able to match a thread with a free core, rather than with a busy core.

0 Kudos
jimdempseyatthecove
Honored Contributor III
597 Views
The affinity setting choice I use in QuickThread is a good comprimize that gives the programmer the control they need.

Compute class threads are pinned, one thread per hardware thread (you can override this)
I/O class threads not pinned (0 to n I/O class threads permitted).

Tasks can be scheduled:

without regard to affinity
with regard to affinity
with proximity of current thread
exclusive of proximity of current thread
to specific thread
distributed amongst caches, at different levels
to I/O class threads

The tasks can furthrer be classified as required to meet above chriteria or preferred to meet above chriteria. And can be based upon availability of threads in LIFO or FIFO or completion task order

Tasks are not pinned (not in the pthread sense), the threads in the thread pool are pinned and the tasks are scheduled as classified above (or not classified if you do not care).

All this can be done by placing a simple token on the template. This is not a complex method of programming, a first year CS student can master this technique.

On the MTL (Many Cores Testing Lab) 4P x 8 core x 2 SMT (small samples)

parallel_for( ...); // all threads
parallel_for( OneEach_L3$, ...); // one slice per socket
parallel_for( L3$, ...); // all threads in current socket
parallel_for( Waiting$ + L3$, ...); // current thread+ any waiting threads in this socket
parallel_for( NotMyCacheLevel$ + L3$, ...); // socket with most available threads at L3$

where ... are the remaining arguments to the template.

Affinity scheduling in the pthread senserequires atighter programming tollerance of thread start/stops.
Affinity scheduling on a task bases has less programming tolarance, parallel_task( L3$, ...); would mean (for MTL system) any one of 8 threads in current socket can take the task. This is much looser,on scheduling.

I think that most of the opinions expressed in this forum thread have been tempered by lack of ease in affinity based programming. Same can be said of my post from the perspective of having the ease in affinity based programming.

Jim Dempsey


0 Kudos
Matthias_Kretz
New Contributor I
597 Views
OpenMP is a good fit for "own the machine" scenarios".

TBB and Cilk are designed for the opposite casewhere a program is composed from multiple black boxes, and the resource needs of each box is not exposed.


For my use case where I "own the machine" OpenMP performs much worse than TBB. The TBB scheduler and the parallel_for partitioning options make TBB the winner over OpenMP.

I just would like to let you know that TBB has great potential for these scenarios and that you should not exclude these use cases because they seem covered by OpenMP.

0 Kudos
Alexey-Kukanov
Employee
597 Views
It seemsthe Linux kernel makes no difference between affinity of the process and affinity of its main thread. Too bad, because it makes the solution that I would like to be implemented in TBB more restrictive.

The process affinity mask can be obtained at TBB initialization (before worker threads are created), but if at that time the main thread has already pinned to a core,the mask will be of no help for the issue. Another way is to set the mask for new threads to include all cores; but ignoring the process mask, despite being the current behavior, is rather bad and I would like to fix it, asthere might be reasons for a user to restrict a process to just a subset of cores. So the process mask capturedat TBB initialization stillseems the best bet; andif the application restricts affinity of its main thread, that should be done only after TBB was initialized. E.g. with such solution implemented in TBB Matthias would still see the issue in his app, but putting TBB initialization before GotoBLAS initialization would help.
0 Kudos
Reply