more information after more debugging:
The CPU time and monotonic clock time are the same. I.e. only one core executes the 24 TBB threads. The remaining 23 cores are idle. :(
The best guess at what is happening so far is that GotoBLAS, which is statically linked into the final executable, initializes its worker threads and leaves the main thread pinned to one core. After that the TBB initialization creates its threads and does not clear the cpu pinning of the calling thread, thus pinning all worker threads to the same core.
What puzzles me, though, is that I tried the following two things and they didn't help:
Ideas how to make TBB actually use the other 23 cores?
Agree. OS theoretically unable to manage threads better than application, but application in many cases can manage threads better than OS.
And, of course, pinning all threads to one core is not a model example of good thread management. You should not do any conclusions based on this example.
I don't agree with this part. There are many applications in HPC that are allowed to fully use the resources they were assigned as they see fit best. I.e. a batch scheduler is used to schedule processes.
In HPC, pinning the threads to cores can make a lot of sense as it can increase the efficiency. Since GotoBLAS is geared for HPC applications it pins its worker threads (and the main thread to stay away from the worker's cores).
Even in HPC, it sounds plain wrong for a library to assume that it always fully controls the machine. What if the application wants to apply parallelism at higher level than BLAS routines (which may well be more benefitial) and so limit BLAS level parallelism?A library enforcingan application to use a single core is much like thetail wagging the dog.
Keeping main thread to stay away from worker cores is something that an OS can do (and I believe modern OSes do) very easily. If worker threads leave at least one core free (which I think those do, otherwise where to pin the main thread?), the OS scheduler should be able to match a thread with a free core, rather than with a busy core.
For my use case where I "own the machine" OpenMP performs much worse than TBB. The TBB scheduler and the parallel_for partitioning options make TBB the winner over OpenMP.
I just would like to let you know that TBB has great potential for these scenarios and that you should not exclude these use cases because they seem covered by OpenMP.