Community
cancel
Showing results for 
Search instead for 
Did you mean: 
e4lam
Beginner
135 Views

Slow Linux performance when explicitly initializing task scheduler

Jump to solution
Hi,

I've run into strange performance issue that only reproduces on Linux (gcc 4.4) but is fine on Windows (VC8). I've got a case that seems to run fine if I do NOT explicitly create a task_scheduler_init object. As soon as I do that in my main(), using even just with default constructor arguments, my test cases slows down. Specifically, it slows down because it looks like there is some lock while running tbb::parallel_for() that prevents it from fully multithreading. On my i7 (4 cores hyperthreaded to 8), CPU usage is always roughly 12%. If I comment out the explicit task_scheduler_init construction, my test case is able to achieve around 50-60% CPU usage (the test case multithreads poorly).

The parallel_for() usage looks like this (where myThreadCount = 8):

static tbb::simple_partitioner part;
tbb::parallel_for(tbb::blocked_range(0,myThreadCount,1), apply, part);

The problem behaviour is same with the debug TBB library and as far as I can tell, also the same between TBB 2.2 or TBB 3.0. The full info below with TBB_VERSION=1 you'll note was taken against a debug tbb.

Any pointers?

Thanks,
-Edward

PS. In my other test cases, the behaviour does not occur, possibly due to less memory allocation?

TBB: VERSION 3.0
TBB: INTERFACE VERSION 5000
TBB: BUILD_DATE Mon May 17 15:47:20 UTC 2010
TBB: BUILD_HOST andorra (x86_64)
TBB: BUILD_OS Ubuntu 9.10
TBB: BUILD_KERNEL Linux 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 04:38:19 UTC 2010
TBB: BUILD_GCC gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)
TBB: BUILD_GLIBC 2.10.1
TBB: BUILD_LD
TBB: BUILD_TARGET intel64 on cc4.4.1_libc2.10.1_kernel2.6.31
TBB: BUILD_COMMAND g++ -DTBB_USE_DEBUG -DDO_ITT_NOTIFY -g -O0 -DUSE_PTHREAD -m64 -fPIC -D__TBB_BUILD=1 -Wall -Wno-parentheses -Wno-non-virtual-dtor -I../../src -I../../src/rml/include -I../../include
TBB: TBB_USE_DEBUG 1
TBB: TBB_USE_ASSERT 1
TBB: DO_ITT_NOTIFY 1
TBB: ALLOCATOR scalable_malloc
TBB: RML private
TBB: SCHEDULER Intel
TBB: VERSION 3.0
TBB: INTERFACE VERSION 5000
TBB: BUILD_DATE Mon May 17 15:47:20 UTC 2010
TBB: BUILD_HOST andorra (x86_64)
TBB: BUILD_OS Ubuntu 9.10
TBB: BUILD_KERNEL Linux 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 04:38:19 UTC 2010
TBB: BUILD_GCC gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)
TBB: BUILD_GLIBC 2.10.1
TBB: BUILD_LD
TBB: BUILD_TARGET intel64 on cc4.4.1_libc2.10.1_kernel2.6.31
TBB: BUILD_COMMAND g++ -DTBB_USE_DEBUG -DDO_ITT_NOTIFY -g -O0 -DUSE_PTHREAD -m64 -fPIC -D__TBB_BUILD=1 -Wall -Wno-parentheses -Wno-non-virtual-dtor -I../../src -I../../src/rml/include -I../../include
TBB: TBB_USE_DEBUG 1
TBB: TBB_USE_ASSERT 1
TBB: DO_ITT_NOTIFY 1
TBB: ALLOCATOR scalable_malloc
TBB: RML private
TBB: SCHEDULER Intel

Thanks,
-Edward
0 Kudos

Accepted Solutions
RafSchietekat
Black Belt
135 Views
"Can someone confirm whether this is supported behaviour?"
Linux fork() only keeps the thread that called fork(), so that's going to throw a wrench in the works all right. Maybe that's the cause of the slowdown, if task_scheduler_init thinks it has already launched workers but they then silently disappear at fork() time and the process becomes single-core? Without an explicit task_scheduler_init, perhaps fork() occurs first, before it can upset things. Just some possibilities, but only you know how your program is constructed. I think that Mac OS X, with its Mach microkernel, also keeps just one thread. Perhaps you can call other functions to keep all the threads in Linux and OS X, if that's what you want to do, but from the theory above you may be able to fork() first and avoid the issue altogether. Let us know!

View solution in original post

14 Replies
RafSchietekat
Black Belt
135 Views
Is the task_scheduler_init long-lived? What is myThreadCount doing in that parallel_for?
RafSchietekat
Black Belt
135 Views

(accidental resend)

e4lam
Beginner
135 Views
Sorry, the task_scheduler_init *is* long-lived. I lied when I said it was in main(). It's actually a file-level static.

myThreadCount is 8 and it's there for legacy reasons. So yes, in this case, parallel_for() just gets repeatedly invoked for very simple task management.
e4lam
Beginner
135 Views
After some fortuitous testing, I've discovered that the culprit was due to fork(). It appears that if an instance of tbb::task_scheduler_init is created, and then the process forks (in order to perform the standard Unix backgrounding trick), the task scheduler seems to go into some serious locking.

Can someone confirm whether this is supported behaviour? For sure, I'm going to workaround this in the mean time but is this something that TBB should handle itself? Thanks!
RafSchietekat
Black Belt
135 Views

"It's actually a file-level static."
Hmm, does that even work, I wonder... I always thought it had to be associated with a particular (user) thread, typically as an automatic variable (on the stack). How does this interact with any (other) dynamic initialisation? But I defer to the TBB team for this one.

"myThreadCount is 8 and it's there for legacy reasons."
Explicit uses of thread counts have often been counterproductive.

e4lam
Beginner
135 Views
The fork() problem is even more serious on OSX, where calling task_scheduler_init::terminate() can crash after forking.
Andrey_Marochko
New Contributor III
135 Views
Where do the problems happen? In the parent or child process?
e4lam
Beginner
135 Views
The problems happen in the child process. (The parent process exits right away)
RafSchietekat
Black Belt
136 Views
"Can someone confirm whether this is supported behaviour?"
Linux fork() only keeps the thread that called fork(), so that's going to throw a wrench in the works all right. Maybe that's the cause of the slowdown, if task_scheduler_init thinks it has already launched workers but they then silently disappear at fork() time and the process becomes single-core? Without an explicit task_scheduler_init, perhaps fork() occurs first, before it can upset things. Just some possibilities, but only you know how your program is constructed. I think that Mac OS X, with its Mach microkernel, also keeps just one thread. Perhaps you can call other functions to keep all the threads in Linux and OS X, if that's what you want to do, but from the theory above you may be able to fork() first and avoid the issue altogether. Let us know!

View solution in original post

Andrey_Marochko
New Contributor III
135 Views
Yes, Raf is absolutely correct. The cloned process will have a copy of all user level data structures, but worker threads will not be cloned. What happens when TBB tries to wake up or join not existing thread is known only to pthread and OS kernel developers, but obviously it does not go well.

If you have to use static task_scheduler_init, you could construct it in deferred mode, and do actual initialization only after forking as Raf suggested.
e4lam
Beginner
135 Views
Yes, I had already changed the mode to deferred once I found the crash and it does indeed seem to work. Thanks! It might be worthwhile adding a note to the documentation of task_scheduler_init even though one should know better. :)

I'll test this but I'm slightly afraid of the following event sequence:
- A deferred task_scheduler_init object is created
- The fork happens
- Some parallel work is executed, thereby implicitly creating the default number of threads
- Then the task_scheduler_init object is called to initialize with less threads than the default


e4lam
Beginner
135 Views
And as I suspected, the above sequence doesn't behave well. Since the scheduler is reference counted, there seems to be no way to shut down the workers created by the implicitly created scheduler. :(
Alexey_K_Intel3
Employee
135 Views
Is it possible to "activate" task_scheduler_init object right after the fork, before the first parallel loop starts?
e4lam
Beginner
135 Views
It's certainly possible but I'm working at a fairly low layer of a large application suite that I've been switching to use TBB. I was trying to mimic some preexisting API functionality that allowed the user to dynamically reconfigure the number of threads.